The sad state of stateful Pods in Kubernetes

Kubernetes offers great cloud infrastructure abstractions and orchestration features for cloud-native applications. But how well does it deal with stateful Pods, such as databases?

A little bit of history

Kubernetes was initially targeted squarely at running stateless applications. That is not opinion, it is a fact. Support for anything resembling stateful Pods only arrived in version 1.5 with StatefulSet, and it only recently lost its beta status. That was in version 1.9. At the current time of writing, 1.11 is the latest version. Now that StatefulSet is regarded as stable, can you trust it?

Regular Pods in Kubernetes are ephemeral, that is, they can be terminated whenever Kubernetes feels like it. The scheduler wants to optimize placement? Terminate some Pods. QoS requires more resources on a machine? Why not terminate some Pods! And so forth. Pods are by definition stateless, so this should not be a problem. Of course, it is annoying when a Pod is terminated. To get a new replacement up and running, you typically do not deploy just a Pod by itself. Instead, you use a Deployment, which then delegates to a ReplicaSet to keep a certain number of replicas alive. Think of them as autoscaling groups, if you will.

Permanent Volumes

But stateful Pods cannot just be re-started somewhere else. The new replica has none of the old state, making it useless. Kubernetes tried to solve this by allowing Pods to request a Permanent Volume for data storage. Such volumes are attached to a Pod with the appropriate Permanent Volume Claim. Should it be terminated, Kubernetes will re-attach the Permanent Volume to a new Pod replica when it is started. This works because the replica will inherit the Permanent Volume Claim, and Kubernetes figures out the cloud API details of making this all happen. Under the hood, it will use e.g. EBS volumes and attach them to the node where the Pod runs. Problem solved, right?

It all sounds well and good, but there are many catches. What if data was actively written when the Pod was terminated? Or what if the node was the last one in an availability zone when the Pod was terminated, and no other node can attach the EBS volume (they are AZ-specific)? Sometimes the claim doesn’t get inherited properly. And then, what if…

StatefulSet for deploying stateful Pods

StatefulSet is the abstraction that was supposed to solve all these issues. They give each Pod in the set a unique identifier, and the StatefulSet documentation says the following about them and intended use:

StatefulSets are valuable for applications that require one or more of the following.

  • Stable, unique network identifiers.
  • Stable, persistent storage.
  • Ordered, graceful deployment and scaling.
  • Ordered, graceful deletion and termination.
  • Ordered, automated rolling updates.

Now we’re talking! Stable network and storage! Even ordering in how they are manipulated, so smart databases should not fall over when an upgrade is rolled out. That should be all you need, right?

Well, no.

The problem is that StatefulSet does not understand anything about what is going on inside the stateful Pods. It is an abstraction layer, and by definition, abstractions are bad at dealing with details.

Experience running stateful Pods

Sadly, many have experienced that StatefulSet does not fix all problems. It is a useful building block, for sure. But the fact is that many database systems (primary example of stateful components) cannot reliably work if they can be terminated at will. Especially without prior warning.

Example

Consider, for example, a rolling upgrade of the nodes in the Kubernetes cluster. This needs to be done from time to time for security reasons. Say that we have a replicated master-slave database, such as PostgreSQL running in a StatefulSet. The node hosting the database master upgrades and reboots. The database master was busy processing some transactions. Those are likely lost. Some may have been replicated correctly to the slaves, some may not have been.

The loss of the master will trigger a re-election of a new master among the slaves. Note that the only way to offer the stability cited above is to re-schedule the old master Pod to the same node. Because of that, the old master joins a cluster that now has moved on to elect some other Pod as the new master. When the Kubernetes hosting node comes up, Kubernetes notices this, and immediately starts to roll out upgrades to the next one. Perhaps it hits the new master next. What is the state of the database cluster now? Which transactions have been committed? How happy was the database cluster about being disturbed this way?

This is not a made-up example, by the way. Of course it is not. If your heart started racing and you got a bit sweaty from reading it, you know how true it is.

Objection!

But wait, would not a rolling OS upgrade cause issues for a manually managed database cluster, too? Yes. But since it would be carried out by domain experts, and not just adhere to Kubernetes’ basic view of when a node is ready again, it would be under much more controlled conditions. Can we package and automate that domain expert knowledge, somehow?

Operators to the rescue?

The CoreOS team (now part of RedHat) developed the concept of Kubernetes Operators. An Operator implements common operational tasks in code. These are run either manually when an API is invoked, or automatically when required or on a schedule. Such tasks could be “back up database” or “create a new read replica”. As such, Operators can reduce the administrative burden even for complex systems.

However, as we all know, automating the relatively easy tasks is easy. It is much harder when the tasks are more difficult. Adding a read replica may be easy, but fixing a database’s broken write-ahead-log file that was corrupted by a failing file system is not. Therefore, the engineering effort that goes into Operators is considerable. The etcd Operator is one of the most mature ones, and it currently has about 9,000 lines of code. And counting.

Sadly, it is unlikely that any Kubernetes Operator can cover all operational aspects of even a single complex stateful data store. They definitely make certain tasks easier. But if they could cover all the error cases and recover automatically, why would that functionality not already be in the code of the stateful data store to begin with?

Reasons for the sad state of affairs

There are two main reasons why the state of affairs are so sad. One is technical, the other relates to financial incentives.

  • From a technical point of view: managing stateful data is hard. There is just no way around it. Much research has been done and various approaches including redundancy, high availability, and so on have been developed. The fact remains that computers and their surrounding systems fail. With diligence and skill, domain experts can mitigate many of these errors.
  • Regarding incentives, we all need to remember that all the major cloud providers sell hosted databases as a service. They have little interest in offering, for free for anyone to use, a service that competes with that value offering. Especially if it works just as well on a competitor’s infrastructure.

The battle between the cloud giants is fierce. Indeed, it can be argued that Kubernetes itself stems from Google’s attempt to dethrone AWS as the predominant cloud provider. Being able to solve the hard problem of dealing with stateful data stores is part of what makes them money. Their open source offerings are seldom pure goodwill, but rather tactical ways of moving customers away from the competition.

Summary and suggested solution

We have discussed the sad state of stateful Pods in Kubernetes. We covered history in terms of Kubernetes abstractions, looked at the modern approach, and saw that it failed. Kubernetes Operators, while able to solve parts of the puzzle, are not going to replace domain expertise any time soon. We also saw that the huge players in this field shy away from this difficult topic both because of its difficulty, and because of lacking financial incentives.

You should ask yourself this. Is what makes your business unique your ability to manage databases (or other stateful data stores)? No? Then get a hosted database service from your cloud provider. Spend your time and effort on what makes your business unique instead. And on the off chance that you answered “Yes!”, then you should go out and find everybody out there who answered “No”. Because there are many out there!

Incidentally, that brings us to our second suggestion. If you deploy on-premise, so no hosted cloud solution is available, you should find experts who can help you.

But no matter what you do, keep stateful data stores out of Kubernetes. At least, for now.

2 Responses to “The sad state of stateful Pods in Kubernetes”

Leave a Reply