This article will explore building geo-distributed applications, with a particular focus on using NewSQL databases.
If you can endure some theory, you will be richly awarded in the end with sample code to run your WordPress-powered blog or CMS across three (!) Kubernetes clusters in three (!) different geographical regions.
The database pendulum
For many years, running an application that requires traditional ACID-style database semantics has translated into running a single fat database or, at best, a clustered solution where the replicas are within close proximity of each-other. Essentially, relying on an “SQL database” has prevented you from achieving the so highly desired scalability and availability that NoSQL databases (such as Cassandra) offer. This caused a movement towards NoSQL databases which, as we’ll discuss later, come with a different set of limitations. It appears as though the pendulum has started to swing back.
In recent years, a lot of advances have been made in the so-called NewSQL segment of databases. Advances that appear to contradict the CAP theorem by promising both strong consistency and high availability at the same time (more on this later). If these claims are true, the road towards geo-distributed applications for the masses appears to lie wide open. In this post, we’ll examine how to build a geo-distributed application for a non-trivial application. Namely, WordPress!
Okay, so you may (rightfully) ask why it’s important to have a blog/website engine achieve serious levels of availability. The average reader of your relative’s cat blog may not be too upset by some service downtime. However if you, like many others, use WordPress to service the customers of your e-commerce store, then every second of downtime means lost revenue. WordPress also serves as a good use case for our little experiment, since it is well-used (serving the content of 30 percent of all websites [source], and has non-trivivial state needs (more on that later).
But let’s start with some theory!
High availability – just more of the same really
What does it mean for a system to be highly available anyway? It’s a relative concept, since in real life one cannot guarantee 100% availability. Instead, availability is measured in nines. For example, 99.9% availability means that you will experience no more than about 9 hours of service downtime during a full year.
So, how does one achieve high availability? A good start is to make your system redundant. This typically means using multiple copies of each component, preferably scattered over different “zones”. Okay, this requires some explanation. Most cloud operators have several “zones” in each geographical region they service. Each such zone is essentially a data center with separate network cables and power supplies and, therefore, independent failure domain from the other zones. So, the recommendation is often to run your application in several of these zones, so that your application can tolerate a zone outage (which is rare, but does happen). Since zones are located rather nearby (in the same cloud region), network latency is low.
However, to make your system even more robust, you should try to spread your application over a larger geographical area, spanning several regions, not just zones within a single region. Region outages do happen, for example due to catastrophic events or poor failure-isolation on part of your cloud provider. Therefore, to achieve those extra nines of availability we should make our
Distributed State – beautiful but hard. Damn hard.
The good news is that high availability is really easy to achieve. Well, unless state is involved. Then it gets really hard. Why is this so?
Distributed state really boils down to solving the consensus problem. That is, having a group of network-distanced processes agree on a (set of) value(s) in the presence of unreliable components (networks, hosts, processes) and the lack of a synchronized global clock. Notably, there is no way for a process to distinguish a failed peer from one that is merely delayed by a slow network/processor/etc. Neither is it possible to use the system clock to order events between processes, since there is no upper limit on the amount of clock skew between two hosts (it can, not uncommonly, be around 250ms).
In the particular case of distributed databases, these circumstances make it really hard for a collection of database replicas to determine which transctions to accept, and in what order.
There are many clever algorithms for building distributed state stores, but they are all forced by the CAP theorem to make a fundamental design trade-off between availability and consistency in the event of a network partition. It has commonly been expressed as a pick-two-out-of-three type of trade-off with the following properties being the ones to “choose from”:
- (C)onsistency: “every read sees the most recent write”
- (A)vailability: “reads and writes are always accepted”
- (P)artition tolerance: “the system continues to operate despite message loss due to network/node failure”
This perspective is flawed. The choice, as shown in the image below, is really between consistency and availability and only when a network partition happens; at other times no trade-off has to be made.
- A system that chooses to be AP will continue accepting writes on both sides of the partition and may therefore drift apart (state-wise), but with a conflict resolution protocol in place on both sides of the partition, the members will eventually reach a common state after the partition has healed (some values that were written by clients on one side of the partition may no longer be reflected in the state though).
- A system that chooses CP will only accept writes on the majority side. This can be made to guarantee 100% consistency.
There are several important observations to be made here (as described in much greater detail here):
- The CAP theorem is asymmetrical: CP systems can guarantee consistency, whereas AP systems do not guarantee 100% availability. 100% availability is a theoretical construct. In reality availability is only measured in nines. So there is really only one side of the theorem (CP) that provides a useful guarantee.
- Choosing CP does not force us to give up on the guarantee of A, since there never was such a guarantee to begin with.
- Guaranteeing C causes a reduction in (already imperfect) A. By how much? In practice, very little. It only happens on network partitions where clients are “trapped” within the minority partition, which is rare. In fact, in practice a system can guarantee C and achieve HA at the same time, such that it is “technically CP, but effectively [also] AP” [ref].
- Generally speaking choosing A buys very little in effective AP, but the loss of consistency pushes a great deal of complexity into application code.
- The only really good reason for choosing an “A”-system is to reduce write-latency.
In fact there is a PACELC theorem that offers an extension to the CAP theorem, which captures that there are two trade-offs to be made — one betweem availability and consistency during partitions, and another one, in the absence of partitions, between consistency and latency.
So how can, so called, NewSQL databases like Spanner, CockroachDB, TiDB claim to be both consistent and highly available? Now we have the answer. They are technically CP but, for all practical purposes, also AP.
Now that we know how geo-distributed databases can achieve the best of two worlds (consistency and availability), we should feel reassured to start using them to empower our applications with even higher degrees of availability.
For this exercise, we’ll use WordPress and run it across three Kubernetes clusters located in three different Google Cloud regions, and fronted by a global cloud load-balancer.
WordPress is a simple PHP web application which stores its state in two places:
- a MySQL database
- a file system
Most of the data, such as posts and pages are stored in the database, but a few assets (such as uploaded media, themes, and plugins) are stored on disk. So in order to run WordPress over several regions, and survive one region going down, we need to ensure that these state stores are replicated across the regions.
For the database, we can use TiDB, a MySQL-compatible NewSQL database that scales horizontally and supports ACID transactions. As mentioned, it is a CP-style system that uses the Raft protocol and a distributed key-value store to store data.
The file system is a bit trickier. However, since the file-system is arguably less important to keep in perfect sync (we can accept a few seconds delay before an image gets replicated across all sites), we’ll use an asynchronous replication approach with the help of SyncThing, which can listen for filesystem notifications (inotify) to send file-system updates to its peers. Think of it as a bidirectional rsync.
This basic setup is illustrated in this image:
The following image shows, in greater detail, what the setup looks like (click for bigger version).
If you’d like to read more about the details and/or try out this example yourself, follow the instructions in the demo github repo.
The code is far from perfect and has a long ways to go before being ready to put into production (did someone say security?), however it does show the feasibility of this type of highly fault-tolerant and geographically distributed setup.
It won’t give you 100% uptime (this is unattainable, remember?). In particular, you may see some occasional 502 (Bad Gateway) errors if you “take down” a region (for example, by stopping all instances in a region), before the load-balancer health-checks have discovered the unhealthy endpoints and taken it out of rotation. To reduce the risk of this happening, you can reduce health check timeouts/intervals manually for the Cloud load-balancer (it is not yet supported in the Terraform module, Issue #28).
- Be careful before you decide on an AP state store. In most cases, trading off consistency for lower latency is a poor deal, that you will pay back with interest when you try to deal with inconsistencies in code.
- NewSQL open up interesting opportunities for geographically distributing applications with traditional (ACID) database needs.