Elastisys operates security-hardened Kubernetes platforms on EU cloud infrastructure. Doing so has taught us many lessons, and we are giving them away to you, so you can learn how to operate a Kubernetes platform yourself.
This is the fourth blog post in a series on operating a secure Kubernetes platform. The whole series is found via this tag on our blog. The topic of this post is maintenance.
In the context of Kubernetes cluster management, maintenance can refer to upgrading software, cleaning out old files from worker Nodes, and a slew of other topics. In this blog post, we focus mainly on all things related to software upgrades, since they are so important for the overall security of your cluster.
Read the entire blog post to see why continuous maintenance via software upgrades is important, how it’s applied in practice, why it matters, and its greater context.
Ask anyone and they will tell you that maintenance is a complete waste of time … until a bridge collapses. You might think that platform engineering fares better than civil engineering, and you would be wrong. DataDog’s Container Report 2020 found that the most popular Kubernetes version at that time already reached End-of-Life (EOL). In other words, not only were those Kubernetes clusters likely vulnerable, they didn’t even receive security patches anymore.
Unfortunately, this issue is too prevalent in many orgs running Kubernetes. Even if you are not pressured by tight data protection regulations, your customers' data must be protected by process and not luck.
So let us look closer at how to do maintenance.
In the context of this guide, maintenance refers to applying security patches and updates. Let us zoom in a bit on these concepts.
At the end of the day, your Kubernetes platform is a dependency in a larger software system. To avoid "dependency hell" and teams stepping on each other’s toes, Semantic Versioning argues for the following approach. First, you need to declare a scope. Then, depending on how your platform changes towards your application developers, your team distinguishes between:
- Security patches, which are minimal changes to a software required to address a vulnerability.
- Minor updates that add functionality in a backwards-compatible manner, i.e., application developers won’t notice.
- Major updates that add functionality in a potentially backwards-incompatible way, i.e., application developers will notice.
At the very least, you must perform security patches, which are small and rather non-risky. However, eventually your Kubernetes version will reach End-of-Life (EOL) when it will no longer receive security patches. Hence, you should plan for minor and major updates as well.
Maintenance can be done "just in time" – e.g., when a security patch is available – or periodically. We recommend doing regular maintenance periodically. Critical security patches should be applied immediately, and by "critical" it is usually understood that a working exploit is known to exist and can be used to compromise platform security. Generally, such "drop all" situations should be avoided by investing in defense in depth). Setting a monthly maintenance window is beneficial for a few reasons:
- It creates good habits and avoids the platform team having to take a decision every time. ("Should we do maintenance next week?")
- "Consistency reduces complexity". This quote is used amongst others by the agile manifesto to argue for having a daily stand-up at exactly the same time and location, with exactly the same format. Maintenance is already complex; no need to add a layer of complexity by changing when they occur.
- It avoids bulking together a large number of changes or migration steps, which increases downtime risk.
- It avoids being surprised by End-of-Life (EOL).
- It makes maintenance easy to communicate and agree with external stakeholders, in particular the application team.
How to Perform Maintenance?
Search on the Internet for "Kubernetes upgrades" and you will likely bump into the latest incarnation of GitOps. And while automation is an important tool to reduce maintenance burden, we found that the challenges with maintenance are often around and not with the maintenance itself.
First, make sure you agree on a maintenance window with external stakeholders, in particular the application team. Ideally, the maintenance window should be large enough to make room for complex maintenance without needing renegotiation. Again, "consistency reduces complexity".
For major changes, make sure to inform the application team well in advance, at least 1 month, but perhaps even 2 months ahead. If budgets allow and avoiding downtime is a must, provide the application team with two Kubernetes platform environments: staging and production. These should receive maintenance on a skewed maintenance schedule. Ideally, you should give the application team enough time to check their application in an updated staging environment, before updating the production environment.
At first, maintenance will be a very manual process, so it needs to be properly prepared. Make sure you decide what to do during the maintenance window:
- Should you update Kubernetes? The Operating System (OS) base image? System Pods? All?
- Consider starting with fewer things to update until you realize that you are underutilizing the maintenance window.
- Make sure to have a contingency plan. What will you do in case something goes wrong during maintenance?
- After each maintenance, make sure to make a retrospective and inject improvements into your quality assurance process.
Hopefully, your platform team will find that maintenance becomes "boring" after some time. Then, it’s time to automate it. Here is how:
- For automating OS updates, we recommend the unattended-updates package and kured – if you want or have to update Nodes in-place or Cluster API if you’d rather replace Nodes. Whatever solution you choose here, make sure it safely drains Nodes to reduce application downtime.
- For Kubernetes updates, we recommend using Cluster API.
- For system Pods, e.g., fluentd, we recommend Tekton. The reason why we chose Tekton deserves a post in itself. For now, let’s just say that, although there are plenty of other solutions, we found Tekton to be the most suitable for our needs.
Make sure not to confuse platform Continuous Delivery (CD) with application CD. The former implies updating functionality offered by the Kubernetes platform itself and might include changing the Kubernetes API via CustomResourceDefinitions (CRDs), Webhooks, etc. The latter should ideally only consist in deploying a handful of namespaced resources, like Deployments, Services and Ingresses. Depending on how you set the scope of the Kubernetes platform, the application CD might be maintained by the platform team, but configured by the application team. All-in-all, a solution which fulfills application CD requirements might be suboptimal for platform CD. Hence, you should really think of choosing one solution for platform CD and one solution for application CD, even though you might assess that one solution sufficiently fits both needs.
Maintenance Issues We Encountered
Let us share with you some of the issues we encountered during maintenance.
- Insufficient replicas: An application may be configured with a single replica, which causes downtime when the Node hosting the single application Pod is drained.
- Lack of Pod Topology Spread: An application may be configured with two replicas, but the Kubernetes scheduler placed both Pods on the same Node. Needless to say, this causes a short downtime while the Node hosting both application Pods is drained.
- Draining takes too long: This is a symptom that the application is not properly handling SIGTERM and needs to be killed non-gracefully. By default, this adds at least 30 seconds per Node. Unfortunately, we discovered Helm Charts which set terminationGracePeriod to as high as 120 seconds which, multiplied by the number of Nodes which need to be drained, greatly impact the duration of the maintenance.
- Draining is not possible: This is often the case with badly configured PodDisruptionBudgets.
- Insufficient capacity in the cluster: If you update Nodes in-place, you must drain them, which temporarily reduces cluster capacity. Make sure you have extra capacity in the cluster to tolerate a Node leaving.
To avoid the application team and/or the platform team getting surprised by such maintenance issues, we recommend going through a go-live checklist before going to production.