Backup Kubernetes – how and why

Backups are important. How many times have you read that you should always backup before doing this or that? How many times did the documentation actually explain how and what you should backup? And if it did, did it also explain how to restore the backup if needed? Unfortunately, backup and restore documentation is often lacking. And this is the case even for serious projects like Kubernetes. Read on to learn more about backups in Kubernetes how to do them, what to include and why.

If you search for “backup Kubernetes” on google you will probably find this page. It is probably what you are looking for if you are running the Canonical Distribution of Kubernetes, but for the rest of us it’s quite useless. Unfortunately, the rest of the results are not that helpful on their own. There are pages pointing out that you should do backups, some references to solutions like Heptio Ark and descriptions for how to back up etcd. But it is hard to find anything putting the pieces together. As of writing, there is actually an open issue regarding this from 2016!

Why backup Kubernetes?

This may seem like silly question but it is quite important to know what the backup is for before you decide on how to do it. Some may even wonder if backups are needed at all in Kubernetes. Isn’t it all about stateless applications that you can easily redeploy on any other cluster? Well state is still quite useful or even necessary, besides, avoiding  downtime (due to migration) is still a thing for stateless applications. There are essentially two reasons for backing up:

  1. To be able to restore a failed master node.
  2. To be able to restore applications (with data).

As you may know, the workload will happily keep running even if the control plane (i.e. the Kubernetes master(s)) goes down. That is, unless the workload needs to talk to the API, of course. But this isn’t very helpful unless you are able to restore the control plane later. In other words, you need to make backups for the reason of restoring a master node. And this is of course especially important if you run a cluster with just a single master.

The second point in the list relevant for restoring/migrating the workload to a new cluster or restoring a single failed application. This requires backups of all the resources in the cluster, along with any state stored in persistent volumes. Note that there is a difference here in that these resources should be completely cluster agnostic. In the previous case the backup was heavily tied to a specific cluster exactly because it was supposed to restore that same cluster. But here we are talking about only the workload, which should be able to run on any cluster.

We will focus mostly on the first point in this post: backing up and restoring a master node.

Why not to backup Kubernetes?

You may notice that we didn’t mention backing up worker nodes in the previous section. This is because workers should be interchangeable in Kubernetes. I.e. it should not matter what node a pod is running on. As long as there are sufficient resources left in the cluster, you should be able to take down/replace a worker without affecting the workload. Some pods may have to be evicted and rescheduled of course, but if you build your applications correctly this should not be a problem.

If you find yourself needing backups of worker nodes (for example because you are using local storage on the node), you should really consider changing the way you deploy your applications instead. Otherwise you are not really taking advantage of what Kubernetes has to offer.

How to backup Kubernetes

The two reasons for backing up Kubernetes gives us (at least) two different backup strategies. One for etcd and relevant certificates in order to restore the master, and one for the applications running in the cluster. It’s time to take a look at how it can be done!

The documentation on etcd for Kubernetes is quite good on a general level. But as a consequence, etcd is treated like a separate component with few connections to the Kubernetes world. This makes it hard to apply the knowledge. It’s simply unclear what an etcd snapshot has to do with your Kubernetes cluster. Furthermore, there is no information about what else you need to backup. So let’s take a look at what’s needed and how to do it.

Backup a single master

As mentioned previously, we need to backup etcd. In addition to that, we need the certificates and optionally the kubeadm configuration file for easily restoring the master. If you set up your cluster using kubeadm (with no special configuration) you can do it similar to this:

# Backup certificates
sudo cp -r /etc/kubernetes/pki backup/
# Make etcd snapshot
sudo docker run --rm -v $(pwd)/backup:/backup \
    --network host \
    -v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd \
    --env ETCDCTL_API=3 \
    k8s.gcr.io/etcd-amd64:3.2.18 \
    etcdctl --endpoints=https://127.0.0.1:2379 \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
    --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
    snapshot save /backup/etcd-snapshot-latest.db

# Backup kubeadm-config
sudo cp /etc/kubeadm/kubeadm-config.yaml backup/

Note that the contents of the backup folder should then be stored somewhere safe, where it can survive if the master is completely destroyed. You perhaps want to use e.g. AWS S3 (or similar) for this.

So what is really going on here? There are three commands in the example and all of them should be run on the master node. The first one copies the folder containing all the certificates that kubeadm creates. These certificates are used for secure communications between the various components in a Kubernetes cluster. The final command is optional and only relevant if you use a configuration file for kubeadm. Storing this file makes it easy to initialize the master with the exact same configuration as before when restoring it.

etcd snapshot explanation

The second command needs a bit more explaining.  First of all, the idea is to create a snapshot of the etcd database. This is done by communicating with the running etcd instance in Kubernetes and asking it to create a snapshot. The reason for the very long command is basically to avoid messing with etcd running in Kubernetes as much as possible. We are launching a separate container using the same docker image that kubeadm used for setting up the cluster (k8s.gcr.io/etcd-amd64:3.2.18). But in order to communicate with the etcd pod in Kubernetes, we need to:

  • Use the host network in order to access 127.0.0.1:2379, where etcd is exposed (--network host)
  • Mount the backup folder where we want to save the snapshot (-v $(pwd)/backup:/backup)
  • Mount the folder containing the certificates needed to access etcd (-v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd)
  • Specify the correct etcd API version as environment variable (--env ETCDCTL_API=3)
  • The actual command for creating a snapshot (etcdctl snapshot save /backup/etcd-snapshot-latest.db)
  • Some flags for the etcdctl command
    • Specify where to connect to (--endpoints=https://127.0.0.1:2379)
    • Specify certificates to use (--cacert=..., --cert=..., --key=...)

So we start a docker container with the etcdctl tool installed. We tell it to create a snapshot of the etcd instance running in the Kubernetes cluster and store it in a backup folder that we mount from the host.

Restore a single master

When the time has come to restore the master, just copy everything back from the backup and initiate the master again. If you want to simulate a master failing you can for example run “kubeadm reset” for a “soft” destruction. But if you really want to make sure you can set it up from zero, you should delete the VM or format the disk. In this case you must remember to do all the prerequisites before initializing it again (e.g. install kubeadm).

The restoration may look something like this:

# Restore certificates
sudo cp -r backup/pki /etc/kubernetes/

# Restore etcd backup
sudo mkdir -p /var/lib/etcd
sudo docker run --rm \
    -v $(pwd)/backup:/backup \
    -v /var/lib/etcd:/var/lib/etcd \
    --env ETCDCTL_API=3 \
    k8s.gcr.io/etcd-amd64:3.2.18 \
    /bin/sh -c "etcdctl snapshot restore '/backup/etcd-snapshot-latest.db' ; mv /default.etcd/member/ /var/lib/etcd/"

# Restore kubeadm-config
sudo mkdir /etc/kubeadm
sudo cp backup/kubeadm-config.yaml /etc/kubeadm/

# Initialize the master with backup
sudo kubeadm init --ignore-preflight-errors=DirAvailable--var-lib-etcd \
    --config /etc/kubeadm/kubeadm-config.yaml

This is pretty much a reversal of the previous steps. Certificates and kubeadm configuration file are restored from the backup location simply by copying files and folders back to where they were. For etcd we restore the snapshot and then move the data to /var/lib/etcd, since that is where kubeadm will tell etcd to store its data.

Note that we have to add an extra flag to the kubeadm init command (--ignore-preflight-errors=DirAvailable--var-lib-etcd) to acknowledge that we want to use the pre-existing data.

Automate etcd backups

Doing a single backup manually may be a good first step but you really need to make regular backups for them to be useful. In other words, let’s automate the procedure! The easiest way to do this is probably to take the commands from the example above, create a small script and a cron job that runs the script every now and then. But since we are running Kubernetes anyway, why not use a Kubernetes CronJob? This would allow you to keep track of the backup jobs inside Kubernetes just like you monitor your workloads!

For more details on how to set up the CronJob check this post from consol labs.

About application data and resources

State is the tricky part here, as is often the case. If your workload is completely stateless, congratulations! You just have to store your YAML manifests somewhere safe and kubectl apply them where ever you want. Unfortunately, everything becomes harder if you have to deal with state. What and how to backup depends on how you are running Kubernetes. Heptio Ark may be a good alternative if you are using one of the supported storage providers. Otherwise, you have to investigate how to make snapshots on your chosen provider manually. Databases all have their own backup and restore procedures.

Final words of advice

Remember to test your backup solution, whatever it may be! Testing should include the full cycle: backup, destroy everything and then restore. If you never test your system this way, can you ever really trust your backup?

If you liked this post, please follow us on LinkedIn for more content like this!

Leave a Reply