Deploying Kubernetes on a private cloud environment like VMware vSphere or OpenStack is great for developer innovation speed, security, and to ensure compliance. But be careful. Changing credentials (password) of the user that deployed the cluster may have unexpected consequences!
Suddenly, your Kubernetes Deployments stop deploying Pods, as the ReplicaSets don’t seem to work anymore. Your API master responds to all your read (kubectl get pods) and delete (kubectl delete pod xyz) operations. And you can even delete and create new Deployments, except neither you via kubectl nor your Controllers can create new Pods. It’s like the Master is just not responding to those APIs. And no (seemingly relevant) Events are created either to help you troubleshoot!
In desperation, you may try to reboot your master(s). When the server comes back online, all you find is that your kubelet process is constantly restarting, and never reaches the point where it can listen to incoming requests.
Locating the root cause
With a Kubernetes Master that is not coming up, all you can do is SSH into the server and start poking around. Your first attempt is of course to see if the kubelet is running. And it might show up in ps aux, but only for a short instant (it restarts really quickly, though). So your best bet is then to view its logs, which on a systemd OS means journalctl -u kubelet (you can add -b to only show the entries since the last reboot). The logs show that the service is restarting constantly.
…and that there seems to be an issue with the credentials.
Credentials for your private cloud
When you deploy Kubernetes using e.g. kubespray, the credentials you use to interact with your private cloud provider are stored on the master nodes. This is because Kubernetes needs to use them to request Persistent Volumes, Load Balancers, and so forth.
So it is obvious that private cloud credentials are needed. And, should you change your credentials due to a password rotation policy that dictates such changes, it is also obvious that the masters cannot make API calls using the old credentials anymore.
What is less obvious is why the master failed in the way it did. Further investigation is needed!
Interesting questions regarding credentials
Some interesting avenues to explore would be:
- Since it worked to create and delete Deployments, does it mean that only etcd access is needed to carry out those tasks? Pod deployment failed, so does it require interacting with the private cloud for some reason? Does the scheduler need to determine some status about the nodes, and uses the private cloud API to do so?
- Why did the master API process not crash hard during runtime when the credentials were wrong due to the change? The API server refuses to start up if the credentials are wrong, so should the entire process not fail loudly if errors of this type appear during runtime?
- Where, if anywhere, are the Events that one could use to determine that this error has occurred? The Deployments did not report any error Event.
Mitigation and lessons learned
To fix this issue, the mitigation/solution is simple. Locate the file that holds the credentials for your private cloud, and update it (most likely called /etc/kubernetes/cloud_config if you used kubespray).
When we install private cloud Compliant Kubernetes clusters for our clients, we always ask that they make a service account for deploying the cluster. Service accounts are not personal accounts, and their permissions can therefore be highly limited. This also makes it possible for our clients to use their existing LDAP to manage permissions. Thankfully, the error described here in this blog post was discovered during internal testing, where one of our engineers had used their own credentials for a personal sandbox cluster.
The lessons to learn here is that:
- Service accounts should always be used, even for sandbox testing.
- System administrators must be aware that Kubernetes can start to behave strangely if the password is rotated for a service account. And that they need to log in to the master and update the file where it is stored. And that the kubelet process must be restarted.
- Kubernetes masters apparently use the private cloud credentials in unexpected ways and for tasks that we did not foresee. As it seems that neither did the developers, we are happy that we employ so many skilled engineers that know how to troubleshoot in a Kubernetes-based environment! (As an aside: getting your team with Kubernetes Certified Administrator training is worth it in the long run!)
So, with all that out of the way, who wants to take a deep dive into the Kubernetes source code and find out why the system behaved like it did and answer the questions from above?