How to move from single master to multi-master in an AWS kops kubernetes cluster

23 January 2018 #

Having a master in a Kubernetes cluster is all very well and good, but if that master goes down the entire cluster cannot schedule new work. Pods will continue to run, but new ones cannot be scheduled and any pods that die will not get rescheduled.

Having multiple masters allows for more resiliency and can pick up when one goes down. However, as I found out, setting multi-master was quite problematic. Using the guide here only provided some help so after trashing my own and my company’s test cluster, I have expanded on the linked guide.

First add the subnet details for the new zone into your cluster definition – CIDR, subnet id, and make sure you name it something that you can remember. For simplicity, I called mine eu-west-2c. If you have a definition for utility (and you will if you use a bastion), make sure you have a utility subnet also defined for the new AZ

[code lang=shell] kops edit cluster –state s3://bucket [/code]

Now, create your master instance groups, you need an odd number to enable quorum and avoid split brain (I’m not saying prevent, and there are edge cases where this could be possible even with quorum). I’m going to add west-2b and west-2c. AWS recently introduced the third London AWS zone, so I’m going to use that.

[code lang=shell] kops create instancegroup master-eu-west-2b –subnet eu-west-2b –role Master [/code]

Make this one have a max/min of 1

[code lang=shell] kops create instancegroup master-eu-west-2c –subnet eu-west-2c –role Master [/code]

Make this one have a max/min of 0 (yes, zero) for now

Reference these in your cluster config

[code lang=text] kops edit cluster –state=s3://bucket [/code]

[code lang=text] etcdClusters:

etcdMembers:
- instanceGroup: master-eu-west-2a name: a
- instanceGroup: master-eu-west-2b name: b
- instanceGroup: master-eu-west-2c name: c name: main
etcdMembers:
- instanceGroup: master-eu-west-2a name: a
- instanceGroup: master-eu-west-2b name: b
- instanceGroup: master-eu-west-2c name: c name: events [/code]

Start the new master

[code lang=shell] kops update cluster –state s3://bucket –yes [/code]

Find the etcd and etcd-event pods and add them to this script. Change “clustername” to the name of your cluster, then run it. Confirm the member lists include both two members (in my case it would be etc-a and etc-b)

[code lang=shell] ETCPOD=etcd-server-events-ip-10-10-10-226.eu-west-2.compute.internal ETCEVENTSPOD=etcd-server-ip-10-10-10-226.eu-west-2.compute.internal AZ=b CLUSTER=clustername

kubectl –namespace=kube-system exec $ETCPOD – etcdctl member add etcd-$AZ http://etcd-$AZ.internal.$CLUSTER:2380

kubectl –namespace=kube-system exec $ETCEVENTSPOD – etcdctl –endpoint http://127.0.0.1:4002 member add etcd-events-$AZ http://etcd-events-$AZ.internal.$CLUSTER:2381

echo Member Lists kubectl –namespace=kube-system exec $ETCPOD – etcdctl member list

kubectl –namespace=kube-system exec $ETCEVENTSPOD – etcdctl –endpoint http://127.0.0.1:4002 member list [/code]

(NOTE: the cluster will break at this point due to the missing second cluster member)

Wait for the master to show as initialised. Find the instance id of the master and put it into this script. Change the AWSSWITCHES to match any switches you need to provide to the awscli. For me, I specify my profile and region

The script will run and output the status of the instance until it shows “ok”

[code lang=shell] AWSSWITCHES="–profile personal –region eu-west-2" INSTANCEID=master2instanceid while [ “$(aws $AWSSWITCHES ec2 describe-instance-status –instance-id=$INSTANCEID –output text | grep SYSTEMSTATUS | cut -f 2)” != “ok” ] do sleep 5s aws $AWSSWITCHES ec2 describe-instance-status –instance-id=$INSTANCEID –output text | grep SYSTEMSTATUS | cut -f 2 done aws $AWSSWITCHES ec2 describe-instance-status –instance-id=$INSTANCEID –output text | grep SYSTEMSTATUS | cut -f 2 [/code]

ssh into the new master (or via bastion if needed)

[code lang=shell] sudo -i systemctl stop kubelet systemctl stop protokube [/code]

edit /etc/kubernetes/manifests/etcd.manifest and /etc/kubernetes/manifests/etcd-events.manifest Change the ETCD_INITIAL_CLUSTER_STATE value from new to existing Under ETCD_INITIAL_CLUSTER remove the third master definition

Stop the etcd docker containers

[code lang=shell] docker stop $(docker ps | grep “etcd” | awk ‘{print $1}') [/code]

Run this a few times until you get a docker error saying you need more than one container name There are two volumes mounted under /mnt/master-vol-xxxxxxxx, one contains /var/etcd/data-events/member/ and one contains /var/etcd/data/member/ but it varies because of the id.

[code lang=shell] rm -r /mnt/var/master-vol-xxxxxx/var/etcd/data-events/member/ rm -r /mnt/var/master-vol-xxxxxx/var/etcd/data/member/ [/code]

Now start kubelet

[code lang=shell] systemctl start kubelet [/code]

Wait until the master shows on the validate list then start protokube

[code lang=shell] systemctl start protokube [/code]

Now do the same with the third master

edit the third master ig to make it min/max 1

[code lang=shell] kops edit ig master-eu-west-2c –name=clustername –state s3://bucket [/code]

Add it to the clusters (the etcd pods should still be running)

[code lang=shell] ETCPOD=etcd-server-events-ip-10-10-10-226.eu-west-2.compute.internal ETCEVENTSPOD=etcd-server-ip-10-10-10-226.eu-west-2.compute.internal AZ=c CLUSTER=clustername

kubectl –namespace=kube-system exec $ETCPOD – etcdctl member add etcd-$AZ http://etcd-$AZ.internal.$CLUSTER:2380 kubectl –namespace=kube-system exec $ETCEVENTSPOD – etcdctl –endpoint http://127.0.0.1:4002 member add etcd-events-$AZ http://etcd-events-$AZ.internal.$CLUSTER:2381

echo Member Lists kubectl –namespace=kube-system exec $ETCPOD – etcdctl member list kubectl –namespace=kube-system exec $ETCEVENTSPOD – etcdctl –endpoint http://127.0.0.1:4002 member list

[/code]

Start the third master

[code lang=shell] kops update cluster –name=cluster-name –state=s3://bucket [/code]

The script will run and output the status of the instance until it shows “ok”

[code lang=shell] AWSSWITCHES="–profile personal –region eu-west-2" INSTANCEID=master3instanceid while [ “$(aws $AWSSWITCHES ec2 describe-instance-status –instance-id=$INSTANCEID –output text | grep SYSTEMSTATUS | cut -f 2)” != “ok” ] do sleep 5s aws $AWSSWITCHES ec2 describe-instance-status –instance-id=$INSTANCEID –output text | grep SYSTEMSTATUS | cut -f 2 done aws $AWSSWITCHES ec2 describe-instance-status –instance-id=$INSTANCEID –output text | grep SYSTEMSTATUS | cut -f 2 [/code]

ssh into the new master (or via bastion if needed)

[code lang=shell] sudo -i systemctl stop kubelet systemctl stop protokube [/code]

edit /etc/kubernetes/manifests/etcd.manifest and /etc/kubernetes/manifests/etcd-events.manifest Change the ETCD_INITIAL_CLUSTER_STATE value from new to existing

We DON’T need to remove the third master defintion this time, since this is the third master