The Elastisys Tech Blog

Secure Kubernetes Operational Practices: Cross-provider Disaster Recovery

Lars Larsson
February 8, 2024
2:57 pm

Recent ransomware-related events have raised awareness regarding the secure operations of IT suppliers. Elastisys provides fully-managed Kubernetes platforms with a rather unique responsibility model: we take responsibility for not just the Kubernetes control plane but for the entire platform. This implies two things: customers put a lot of faith in our ability to securely operate their platforms, and we have very high standards to earn that faith.

In this article, we describe how we set up not just “off-site” but “off-cloud provider” backups and disaster recovery. This explains why our customers can sleep so well at night. They know that Elastisys is able to restore their Kubernetes platforms to another cloud infrastructure provider, even if the provider is ransomware-attacked and unable to recover their operations. All within the four hours that our terms of service state for critical incidents.

How Elastisys is able to restore Kubernetes to another cloud provider

The worst possible scenario with regard to underlying cloud infrastructure providers would be that the entire provider’s software stack has been compromised in a way that affects all its regions. This is not a very common scenario, but the effects would be devastating. Recovery from such a scenario cannot rely on the strategy to point to another region of the same provider; you have to be able to restore to an entirely different provider.

Background and production readiness

Welkin by Elastisys is developed to be infrastructure provider-agnostic. And because we already know how to offer it on many different cloud infrastructure providers, we know that we have the technical expertise to work across providers.

Our extensive backups can be used to restore an entire Kubernetes platform from scratch. This is something that we practice with our Premium customers (and Standard customers can order this for a fee) before they go into production, as per our go-live checklist. Therefore, both we and our customers know that for their production workloads, this is not just a theoretical possibility, but something that has been proven in practice.

Contractual considerations

But just because it works on a technical level does not mean that we can just go add a new subprocessor to our customer’s list of subprocessors. That wouldn’t be GDPR-compliant, and we care a lot about compliance.

When a small Danish cloud provider went out of business in the fall of 2023 after being hit with a severe ransomware attack, where even all their backups were encrypted, it sent shockwaves in the community. We did not offer our services there, but it was a wake-up call for everyone. As soon as we could, we started sending out the request to customers to allow us to add a new subprocessor to their list of subprocessors. Nobody should have to be in the unfortunate situation that this cloud provider, or their customers, were in.

Technical implementation

With both the proven ability to do this on a technical level, and the contractual issues sorted, here is how we solve the technical implementation of this cross-provider security net.

Let’s call the two cloud infrastructure providers in play the “original” and “backup” providers. The original one is where the Kubernetes platform runs normally.

All backups, which include the files needed for PostgreSQL point-in-time recovery, are stored at the original provider. They are also stored at the backup provider, but client-side encrypted, so that the backup provider only sees them as opaque data blobs, unable to decrypt them. For the backup provider, Elastisys will use the “object locking” features of their object storage service (if available), which will prevent all deletion operations of these files (except via ahead-of-time determined object lifecycle rules).

In case the original provider becomes unavailable, we will decrypt the needed files at the backup provider and quickly set up an entire Kubernetes platform there, with the application as well. This is because our backups include all Kubernetes resources, such as Pods and Deployments, as well as the data they store. DNS entries will be pointed to the new location, and the customer application will become available there instead.

Summary

Elastisys has a best-in-class and unique capability to restore Kubernetes platforms across not just cloud regions, but also across cloud providers. This gives Elastisys a proven track record by which it earns the trust to manage Kubernetes platforms even for society’s most critical entities, as defined by NIS2. This article has described how we do this, and the not just technical, but also contractual work that needs to be done to achieve this effect.

In addition to all we do internally, we also continuously audit our cloud infrastructure providers, to not just trust, but also verify that they meet our high standards.

If you want us to manage your Kubernetes platform, get in touch. And if you want to work with a company whose mission is to deliver high-end, premium products and services to the market in the areas of open source, cloud native operations, security, and compliance… we’re hiring.

Blog post by Lars Larsson

I’m Lars, Field CTO at Elastisys. I have been working with cloud technology since 2008 across all levels of the tech stack. Over the years, I’ve provided guidance to numerous companies during their organizational transitions, helping them achieve cost efficiencies through the adoption of cloud-native technology. My posts are primarily intended for technical decision-makers, senior engineers, and architects, offering deep insights into Kubernetes. You can find all my posts here on the Elastisys blog. Additionally, feel free to follow me on LinkedIn, where I share content on Kubernetes, DevOps, and compassionate leadership.

devops, Kubernetes, security

The Elastisys Tech Blog

Categories

Secure Kubernetes Operational Practices: Cross-provider Disaster Recovery

How Elastisys is able to restore Kubernetes to another cloud provider

Background and production readiness

Contractual considerations

Technical implementation

Summary

Blog post by Lars Larsson

Company

Services

Resources

With support from