Over-capacity is not a Safety Net

Auto-scaling? No, we always run a few extra servers as spare capacity, just in case and we use auto-scaling, and have some capacity to spare, in case of emergencies are two sentiments we find online from time to time. In this post, we will look at why over-capacity alone is not a safety net or guarantee against under-provisioning emergencies.

Over-capacity Without Auto-scaling

How did companies deal with uncertain resource demands before the cloud? They would buy a bunch of servers — perhaps adding a few extra to be sure. If a usage peak came, users would face a service that would be slow to respond or even become unavailable. Those days are behind us, with more than 70% of x86 servers being virtualized (Gartner, 2014) and many of those deployed in the cloud. The practice, however, is not.

We know how difficult it can be to leave both the technology and mindset behind (and offer professional services help to companies overcome these difficulties). There is no magic wand that turns legacy applications into horizontally scalable cloud-ready applications. If the problem is not caused directly by legacy software, it is typically due to a legacy mindset. Luckily, that can be more easily be changed. If scaling up or down requires a large amount of system administrator effort and time, it is obvious that the workflow is suboptimal. Incidentally, we can help there, too.

It is clear that only relying on having a few extra servers around is going to achieve two things, namely that:

  • operational expenditures (OPEX) are going to be needlessly high, and
  • a false sense of sufficient capacity will be granted — the application will still buckle under unexpectedly high load.

The word unexpectedly is emphasised because no amount of up-front planning can deal with the unexpected. Predictions based on monitored data can, but plans involving humans and meetings cannot.

Over-capacity With Auto-scaling

What if an application makes use of reactive auto-scaling, then? The technology is ready, the mindset is in place. So why do we claim that always having, say, 20% additional capacity is not enough? The problem is not about the safety margin itself, the problem is how it is calculated in the face of demand changes.

To see why, let’s use an analogy. Consider a bird flying across a landscape. There are some rocks of various sizes that the birds needs to avoid. It tries to avoid hitting a rock by always staying 3 feet (about a meter) above whatever ground or rocks it encounters. It always looks straight down to make sure to keep its safe distance. Deciding to fly higher or lower takes some time, as does the altitude adjustment itself. This approach works reasonably well as long as the rocks are small. Sadly for the bird, if it encounters a rock that is higher than the 3 feet safety margin, the drawbacks of the approach become brutally apparent.

Image of the bird analogy for cloud application capacity

This figure first appeared on Lars’ LinkedIn post on the same subject.

As we have blogged about before, it takes time before capacity adjustments manage to converge with resource demand. A much smarter bird looks ahead, and when it sees a big rock in front of it, it starts flying up toward the sky before flying into trouble. Striving to have a 3 feet safety margin might still be prudent (servers tend to exhibit strange behavior when they are close to over-loaded), but only in combination with actually looking forward.

How to Look Ahead

Intuitively, this all makes sense. Of course one should look ahead. There is no reasonable counter-argument. What does it mean in practice, though? How do we actually do it?

First of all, we need good metrics. We have blogged about how metric choice impacts how much information we actually get to reason about. Say that we have a good metric: number of current requests per second. That tells us by how much we are either over- or under-provisioned, unless our current capacity is perfect for the demand we are seeing. If we plot this data, perhaps with some smoothing going on, we can visually see that there are trends to how our users actually use our service. Nights are typically less busy than days, weekends less busy than workdays (or the other way around, depending on application), and so forth. Scaling just based on that would allow us to avoid over-provisioning, saving money in the process.

The real killer, though, is under-provisioning. We need to avoid having too little capacity, not only because it can wreak havoc among our servers, but because slow or unavailable services cause loss of reputation, sales, etc. To avoid it, we need to not only consider daily patterns, but what is going on right now, and what the current usage trend looks like. If our usage doubles, right now, is that an indication of a new normal state, or was it just a random fluke? We do not want to over-react by scaling up if it is just a fluke, of course. But we would prefer to meet a new normal usage level with confidence and sufficient capacity, as the users arrive.

To do this, we need to employ some nifty mathematical models, statistics, and machine learning techniques. The maths required is similar to that used for weather forecasting, really.

  • we know what typical September weather is like for a given region,
  • we have learned to identify indicators of certain weather patterns, and
  • we know what the weather has been like the past few days.

Taken together, this provides a pretty good short-term idea of what the weather will be like. As with weather forecasting, accuracy goes down the longer into the future we predict. Luckily, we typically only have to predict about 15-30 minutes or so for cloud auto-scaling, depending on our application. This allows us to modify our capacity and be ready for our users. This gives us a well-needed capacity safety net: the capacity is there when we need it, not 15-30 minutes after we have realized that we should have had it.

Summary

Using predictive methods and pro-active auto-scaling is the only way to actually create a capacity safety net for one’s cloud application. The Elastisys Cloud Platform offers not only that, but a number of other very nice features, including multi-cloud capabilities, which we have blogged about.

Has your current auto-scaling solution let you down? Are you wondering how to improve and automate your workflow? Let us hear your stories in the comments below!

Leave a Reply