Backups are important. How many times have you read that you should always backup before doing this or that? How many times did the documentation actually explain how and what you should backup? And if it did, did it also explain how to restore the backup if needed? Unfortunately, backup and restore documentation is often lacking. And this is the case even for serious projects like Kubernetes. Read on to learn more about backups in Kubernetes how to do them, what to include and why.
If you search for “backup Kubernetes” on google you will probably find quite a lot of different solutions. Unfortunately, many of them do not explain the big picture. There are pages pointing out that you should do backups, some references to solutions like Velero and descriptions for how to back up etcd. But it is hard to find anything putting the pieces together. When I first wrote this post I referred to this issue about how to do Kubernetes backups and migrations from 2016. It was closed in the summer 2020 without a fix due to the complexity of the topic.
Why backup Kubernetes?
This may seem like a silly question but it is quite important to know what the backup is for before you decide on how to do it. Some may even wonder if backups are needed at all in Kubernetes. Isn’t it all about stateless applications that you can easily redeploy on any other cluster? Well state is still quite useful or even necessary, besides, avoiding downtime (due to migration) is still a thing for stateless applications. There are essentially two reasons for backing up:
- To be able to restore a failed control plane Node.
- To be able to restore applications (with data).
As you may know, the workload will happily keep running even if the control plane goes down. That is, unless the workload needs to talk to the API, of course. But this isn’t very helpful unless you are able to restore the control plane later. In other words, you need to make backups for the reason of restoring a control plane Node, or be forced to migrate to a new cluster if this happens. And this is of course especially important if you run a cluster with just a single control plane Node.
The second point in the list is relevant for restoring/migrating the workload to a new cluster or restoring a single failed application. This requires backups of all the resources in the cluster, along with any state stored in persistent volumes. Note that there is a difference here in that these resources should be completely cluster agnostic. In the previous case the backup was heavily tied to a specific cluster exactly because it was supposed to restore that same cluster. But here we are talking about only the workload, which should be able to run on any (similar) cluster.
We will focus mostly on the first point in this post: backing up and restoring a control plane Node.
Why not to backup Kubernetes?
You may notice that we didn’t mention backing up worker Nodes in the previous section. This is because workers should be interchangeable in Kubernetes. I.e. it should not matter what Node a Pod is running on. As long as there are sufficient resources left in the cluster, you should be able to take down/replace a worker without affecting the workload. Some Pods may have to be evicted and rescheduled of course, but if you build your applications correctly this should not be a problem.
If you find yourself needing backups of worker Nodes (for example because you are using local storage on the Node), you should really consider changing the way you deploy your applications instead. Otherwise you are not really taking advantage of what Kubernetes has to offer.
How to backup Kubernetes
The two reasons for backing up Kubernetes gives us (at least) two different backup strategies. One for etcd and relevant certificates in order to restore the control plane, and one for the applications running in the cluster. It’s time to take a look at how it can be done!
The documentation on etcd for Kubernetes is quite good on a general level. But as a consequence, etcd is treated like a separate component with few connections to the Kubernetes world. This makes it hard to apply the knowledge. It’s simply unclear what an etcd snapshot has to do with your applications running in the Kubernetes cluster. Furthermore, there is no information about what else you need to backup. So let’s take a look at what’s needed and how to do it.
Backup a single control plane Node
As mentioned previously, we need to backup etcd. In addition to that, we need the certificates and optionally the kubeadm configuration file for easily restoring the master. If you set up your cluster using kubeadm (with no special configuration) you can do it similar to this:
# Backup certificates sudo cp -r /etc/kubernetes/pki backup/ # Make etcd snapshot sudo docker run --rm -v $(pwd)/backup:/backup \ --network host \ -v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd \ --env ETCDCTL_API=3 \ k8s.gcr.io/etcd:3.4.3-0 \ etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \ --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \ snapshot save /backup/etcd-snapshot-latest.db # Backup kubeadm-config sudo cp /etc/kubeadm/kubeadm-config.yaml backup/
Note that the contents of the backup folder should then be stored somewhere safe, where it can survive if the control plane is completely destroyed. You perhaps want to use e.g. AWS S3 (or similar) for this.
So what is really going on here? There are three commands in the example and all of them should be run on the control plane Node. The first one copies the folder containing all the certificates that kubeadm creates. These certificates are used for secure communications between the various components in a Kubernetes cluster. The final command is optional and only relevant if you use a configuration file for kubeadm. Storing this file makes it easy to initialize the master with the exact same configuration as before when restoring it.
etcd snapshot explanation
The second command needs a bit more explaining. First of all, the idea is to create a snapshot of the etcd database. This is done by communicating with the running etcd instance in Kubernetes and asking it to create a snapshot. The reason for the very long command is basically to avoid messing with etcd running in Kubernetes as much as possible. We are launching a separate container using the same docker image that kubeadm used for setting up the cluster (k8s.gcr.io/etcd:3.4.3-0). But in order to communicate with the etcd pod in Kubernetes, we need to:
- Use the host network in order to access 127.0.0.1:2379, where etcd is exposed (
- Mount the backup folder where we want to save the snapshot (
- Mount the folder containing the certificates needed to access etcd (
- Specify the correct etcd API version as environment variable (
- The actual command for creating a snapshot (
etcdctl snapshot save /backup/etcd-snapshot-latest.db)
- Some flags for the etcdctl command
- Specify where to connect to (
- Specify certificates to use (
--cacert=..., --cert=..., --key=...)
- Specify where to connect to (
So we start a docker container with the etcdctl tool installed. We tell it to create a snapshot of the etcd instance running in the Kubernetes cluster and store it in a backup folder that we mount from the host.
Restore a single control plane Node
When the time has come to restore the control plane, just copy everything back from the backup and initiate the control plane again. If you want to simulate a control plane Node failing you can for example run “kubeadm reset” for a “soft” destruction. But if you really want to make sure you can set it up from zero, you should delete the VM or format the disk. In this case you must remember to do all the prerequisites before initializing it again (e.g. install kubeadm).
The restoration may look something like this:
# Restore certificates sudo cp -r backup/pki /etc/kubernetes/ # Restore etcd backup sudo mkdir -p /var/lib/etcd sudo docker run --rm \ -v $(pwd)/backup:/backup \ -v /var/lib/etcd:/var/lib/etcd \ --env ETCDCTL_API=3 \ k8s.gcr.io/etcd:3.4.3-0 \ /bin/sh -c "etcdctl snapshot restore '/backup/etcd-snapshot-latest.db' ; mv /default.etcd/member/ /var/lib/etcd/" # Restore kubeadm-config sudo mkdir /etc/kubeadm sudo cp backup/kubeadm-config.yaml /etc/kubeadm/ # Initialize the master with backup sudo kubeadm init --ignore-preflight-errors=DirAvailable--var-lib-etcd \ --config /etc/kubeadm/kubeadm-config.yaml
This is pretty much a reversal of the previous steps. Certificates and kubeadm configuration file are restored from the backup location simply by copying files and folders back to where they were. For etcd we restore the snapshot and then move the data to
/var/lib/etcd, since that is where kubeadm will tell etcd to store its data.
Note that we have to add an extra flag to the
kubeadm init command (
--ignore-preflight-errors=DirAvailable--var-lib-etcd) to acknowledge that we want to use the pre-existing data.
Automate etcd backups
Doing a single backup manually may be a good first step but you really need to make regular backups for them to be useful. In other words, let’s automate the procedure! The easiest way to do this is probably to take the commands from the example above, create a small script and a cron job that runs the script every now and then. But since we are running Kubernetes anyway, why not use a Kubernetes CronJob? This would allow you to keep track of the backup jobs inside Kubernetes just like you monitor your workloads!
For more details on how to set up the CronJob check this post from consol labs.
About application data and resources
State is the tricky part here, as is often the case. If your workload is completely stateless, congratulations! You just have to store your YAML manifests somewhere safe and
kubectl apply them where ever you want. Unfortunately, everything becomes harder if you have to deal with state. What and how to backup depends on how you are running Kubernetes. Velero may be a good alternative if you are using one of the supported storage providers. Otherwise, you have to investigate how to make snapshots on your chosen provider manually. Databases all have their own backup and restore procedures.
Final words of advice
Remember to test your backup solution, whatever it may be! Testing should include the full cycle: backup, destroy everything and then restore. If you never test your system this way, can you ever really trust your backup?
Elastisys offers a fully-managed Kubernetes service based on Compliant Kubernetes, our Certified Kubernetes distribution that targets the needs of regulated industries such as fintech and medtech. Whether on premise or in the cloud, Elastisys Compliant Kubernetes lets customers focus on building applications, rather than infrastructure. Don’t forget we also offer managed services such as databases and monitoring solutions, too!
And if you liked this post, please follow us on LinkedIn for more content like this!