The Challenges of Availability Zone Outages

The Challenges of Availability Zone Outages

AWS recently had an outage of one of their availability zones in their US East region. The outage took down parts of popular services such as Slack and Hulu. The effects were felt worldwide: during the outage, we were unable to create new Slack chats between individuals, for instance.

In this article, we discuss what a zone outage means for your software architecture and deployment, and what you can reasonably do to be resilient to zone outage effects.

What is a zone outage and how does it affect your application?

AWS, as well as other major clouds, divide their huge cloud regions into independent availability zones. An availability zone is essentially a separate, but geographically closely placed, data center. It has redundant power line connections, Internet connections, storage arrays, and so on. Because it is so close to the other availability zones in the region, they are performance-wise able to be viewed as parts of a single region.

The main difference between an availability zone and a whole new region is that certain cloud management functionality spans across the availability zones in the region. For instance, you can tell AWS S3 to store your objects in a region, and it will spread the data out between the availability zones in that region, to make sure it is highly available.

In contrast, cloud regions are completely separated from each other, management-wise.

Because of shared cloud management within a region, you can even define an AWS Auto Scaling group to deploy your servers across availability zones within the region. So that should go a long way toward letting you automatically handle zone outages. That is because the region’s cloud management software should kick in automatically and help you out.

So why does a zone outage affect applications badly?

The devil is, as usual, in the details. Because while it is definitely true that the regional cloud management system can start servers for you in different availability zones, they might be essentially braindead. That is because your data likely resides on AWS Elastic Block Storage (EBS) volumes. And that service is not regional. So data stored in one availability zone is not immediately available in another zone.

This affects AWS managed services, too, such as AWS Relational Database Service (RDS). The database instances are backed by EBS volumes, too. Databases are good at replication, though, so you can tell the service to run a replicated service across multiple availability zones. But unless you specifically asked for replication, your data can wind up just sitting there in the zone that experiences an outage. Untouchable.

This affects services running on Kubernetes clusters too. Because a Pod that requested a Persistent Volume will have that storage provided by EBS. And that ties the Pod to a particular availability zone. In spite of its great automation, nothing Kubernetes can do will make the data automatically appear in a new availability zone in case of an outage.

Availability zone outages are actually potentially very problematic for Kubernetes itself. Because you need to have the control plane deployed across multiple zones to ensure availability. If you do not, a zone outage will take down your non-redundant control plane. And even if you do, your Kubernetes control plane can at best survive a single zone outage. If more than one goes down, the etcd database that is backing it will be unable to function.

How can you handle availability zone outages?

First of all, let’s note that the problems outlined above obviously relate to stateful components. With stateless components, that is, ones that do not rely on availability zone specific services such as EBS volumes, you can make smart automation choices. Use the regional cloud management features such as Auto Scaling groups to get the capacity you need. And for Kubernetes, by using the Cluster Autoscaler, you can also make sure that the sudden loss of worker nodes will be swiftly compensated for.

Facing zone outage with stateful components is, as discussed, more difficult. Enable cross-zone replication for services that support it. For those that do not, you need to have a disaster recovery plan. Let’s zero in on EBS volumes, since they are at the very center of all this.

You can take an incremental snapshot of EBS volumes, and store them in S3. Remember that S3 is a regional service, not a zone-specific one, like EBS. Incremental snapshots mean that only the changed blocks on the EBS volume will be backed up. This means that they are, relatively speaking, cheap.

Say that we had an EBS volume in zone A, and that we snapshotted it frequently. Also say that zone A fails. You can then spin up a new EC2 instance in zone B and restore the latest snapshot into a new EBS volume in zone B. Attach this new EBS volume to your instance, and you should have an as-recent-as-your-snapshot copy of your data.

This sounds hopeful, but also difficult.

The Kubernetes Container Storage Interface driver for AWS EBS thankfully supports working with snapshots directly. So you can automate a lot of this tedious work.

Your disaster recovery plan must take the possibility of availability zone outage into account.

What about regional failure, then? That can be the stuff of nightmares, but if you want to learn how to be resilient to those, too, we did blog about that before.

Summary

In this article, we discussed what a zone outage means for your software architecture and deployment, and what you can reasonably do to be resilient to zone outage effects. Users of cloud services need to be aware of their limitations, and which services are confined to a single availability zone, and which are regional. When it comes to recovering from zone outages, the devil is in exactly those details. In particular, because storage services are typically zonal, rather than regional. We also presented how to handle those using incremental snapshots and automation.

Lars Larsson

Lars holds a PhD in Computer Science, is a senior cloud architect at Elastisys, and a DevOps expert engineer.