Advanced predictive and pro-active auto-scaling (for City Cloud’s blog)
Elastisys auto-scaling prediction subsystem, © Elastisys AB, all rights reserved.

Advanced predictive and pro-active auto-scaling (for City Cloud’s blog)

In our previous post, we briefly introduced the basics of auto-scaling. What auto-scaling is, why it is important, and how it works, including monitoring, metrics, and all that jazz. In this post, we will dive deeper. We will see how an auto-scaler determines by how much it needs to scale your deployment and what one looks like internally. We use the Elastisys cloud platform for illustrative purposes — we develop it, so it is obviously the one we know best. We also discuss how it uses predictions to pro-actively determine future capacity demand and gives you just-in-time capacity. Grab a nice cup of coffee (might we suggest a nice caff√® mocha?), and read on!

Why predictive and pro-active auto-scaling matters

When Elastisys was founded in 2011, cloud was already surrounded by a lot of hype, but how it was actually used in the industry was nothing like today. Today, companies are increasingly making good use of the dynamic nature of the cloud, and continuously move new workloads into the cloud. Increased overall cloud usage combined with its increasingly dynamic nature means that companies can make their applications more robust and keep performance high even with unexpected usage patterns in addition to large savings in deployment costs.

Adrian Cockcroft, previously Cloud Architect at Netflix and now Technology Fellow at Battery Ventures, stated in a keynote at UCC 2014 that his experience is that predictive auto-scaling saves up to 70% of cloud costs. How did he get that experience? Netflix realized that the simple rule-based reactive auto-scaling solutions offered by cloud providers were unable to offer what they needed. Because they have considerable resources, they developed their own predictive auto-scaler. This work, however, not available to outsiders.

The cost savings argument is obviously going to be strongest for companies that have large cloud deployments to scale. Companies that do not have massive deployments (and hence less to gain in terms of savings when usage dips), benefit more from ensuring that their applications are responsive, even when unforeseen numbers of visitors arrive. This can be a tricky task, in particular if one is not used to such spikes. Research shows that 75% of visitors would not return to pages that took more than 4 seconds to load. This can make or break an online company’s success, regardless of size.

Reactive auto-scaling is of course preferable to no auto-scaling solution, but its main drawback is that, by definition, the application is already under stress when the auto-scaler reacts and tries to provision more resources. Provisioning operations take time. If we also use simplistic threshold-based scaling rules offered by most cloud providers, we are very slow to converge to the required capacity levels to meet current demand. Using predictive methods to pro-actively scale, we can get close to immediate convergence, and our users are all served by a service that seems to have infinite capacity.

Now that we know why predictive and pro-active auto-scaling is important, let’s look at how it is achieved.

“Find X!”

Our previous post had a deceptively simple question at the end:

Simplistic auto-scaling works in a reactive fashion: resource availability is modified as a reaction to when a threshold value has been passed […] add X servers […] [a]nd how does the system calculate the correct value of X, anyway?

It is often easy to notice if an application is under-provisioned: load times increase, requests are dropped, and components can even crash. However, to determine by how much a system is under-provisioned, one needs:

  • good metrics (see the previous post, or this more in-depth one) that capture this, and
  • a good understanding of what the application or service component limitations are per resource unit, so we know what impact adding another server will have.

The first point lets us know how big the problem is, the second how big the solution should be.

A smart auto-scaler should take both pieces of information into account. This lets it close the gap between capacity demand and availability (converge) quicker.

A not-so-smart auto-scaler would simply scale up and down by a fixed amount when threshold values are passed — often a low number of VMs (possibly a single one). For instance, using a rule such as when this value is higher than Y for so-and-so many minutes, scale up by Z number of instances. If the fixed amount is too small to meet current demand, the application is still under-provisioned once the scaling action finishes. Essentially, it will crawl up a staircase to catch up when a jump of several steps would have been more appropriate.

Advanced predictive and pro-active auto-scaling

Inescapably, reacting to changes is by definition to be at least one step behind. The bad thing has already happened (capacity availability mismatch in the auto-scaling case), and we respond to it (by adjusting capacity). Humans are smart, though. We know that if we can use our experience to predict when bad things are about to happen, we can take action before our predictions come true.

Just like checking the weather forecast to avoid having rain ruin a picnic, server capacity problems can be avoided by predictive methods. There are many advanced mathematical and/or computer science tools available for this, including:

  • identifying recurring patterns in our data,
  • statistically analyzing current demand variations, and
  • employing machine learning techniques.

A very smart auto-scaler would use all of these to make sure that resources are already available as demand arises. Does demand always rise during Monday morning, and drop off significantly once business hours are over? Pattern recognition handles that. Is a current spike in usage relatively small, or reason to scale up by a factor 5? Statistics knows the answer. What effect did some scaling actions have for a given usage pattern? Have the auto-scaler learn it, and figure out appropriate actions for the future!

The following table summarizes what we have talked about in this section:

Table 1. Level of smartness for auto-scalers.
 Knows extent of problemScales to size of problemReactive or Pro-active
Very smartYesYesPro-active

For reference, and some additional insights, please see our comparison of competing auto-scaling offerings updated for the fall of 2015 over at our blog.

Auto-scaling in the Elastisys cloud platform

Now that we know what features an auto-scaling solution should have, let’s look at how we have gone about implementing one! It all starts, as always, with some solid theory.

Monitor-Analyze-Plan-Execute loop

In autonomic computing, IBM identified the Monitor-Analyze-Plan-Execute (MAPE) loop, which expresses the actions taken by an autonomous system, like the auto-pilot in an aircraft or the cruise control feature in your car.

Monitor-Analyze-Plan-Execute (MAPE) loop, © Lars Larsson 2015, all rights reserved
Monitor-Analyze-Plan-Execute (MAPE) loop.

In the Elastisys cloud platform, we have three separate systems responsible for parts of these tasks:

  • the monitoring subsystem is responsible for accepting monitoring information and storing information for each metric in a time-series database,
  • the prediction subsystem analyzes monitoring information and establishes a cloud deployment resize plan, and
  • the cloud pool subsystem executes the resize plan by interacting with the cloud provider.

Essentially, the monitoring subsystem contains our input and the cloud pool subsystem acts on our output. We believe in modularity, and we can swap out both monitoring databases and cloud pools to be compatible with various technologies. Cloud pools are capable of interacting with OpenStack-based clouds such as City cloud, and Amazon EC2 using various APIs there (we can work with Spot instances as well as regular on-demand ones, if that’s your cup of tea).

Prediction subsystem

At the very core of our auto-scaler is the prediction subsystem. It looks as follows:

Elastisys prediction subsystem components, © Elastisys AB 2015, all rights reserved.
Elastisys prediction subsystem components.

A set of predictors take different metrics into account and are asked to emit a predicted value for some time in the future. We call that time the “prediction horizon”, and should be long enough to allow an instance to be requested and become fully operational. Does that take 15 minutes? Only five? The right answer depends on many factors, including application-specific ones. The cloud you are deploying your application in makes a big difference. City cloud is very fast in our experience.

Prediction values are converted into compute units via a translation given by the system administrator (e.g. 500 requests per second per server). Compute unit predictions are fed into an aggregation function, so different weights can be assigned for each prediction. This makes it possible to prioritize predictions based on which metrics are most important, for example. Unless there is a good reason not to, simply taking the maximum value is a safe default. The single prediction is then fed into a set of policies, which help avoid too rapid scaling actions — in particular opposing ones.

Rapid opposed scaling actions is bad, because it leads to a pendulum effect of growing and shrinking your deployment. This causes needless strain on your application, as e.g. members in a pool of servers come and go, possibly requiring updates to some shared registry or state. In control theory this is solved by introducing a deadband zone, which prohibits alterations unless the need for one is high enough. This type of cleverness, and more, is embedded in our scaling policies.

Deadband, courtesy of Krishnavedala (own work), via Wikimedia Commons (CC BY-SA 3.0).

Finally, the prediction is bounded via a set of scheduled limits. This gives the administrator full control over both the upper and lower bounds, which ensures a minimum of always-available capacity and that auto-scaling never breaks the budget. These allow the administrator full freedom to express fine-grained rules such as “keep these certain levels for our weekend peaks”, or “allow more capacity during workday evenings”.

Research has shown that no single algorithm can ever truly capture all aspects needed for optimal auto-scaling. Think of weather forecasting: nobody would assume that only a single model or a single metric (e.g. historical weather at this location, amount of cloud cover over past few days, etc.) is used to accurately predict weather. Similarly, we often need to consider more than one type of input and use more than one type of model to predict server load.

Worldclouds 2009
Worldclouds 2009 by NASA Earth Observatory [Public domain], via Wikimedia Commons.

Elastisys offers a range of predictors, each suited for a particular task. Some are good at detecting recurring patterns, some are good at determining current demand variations. Others employ machine learning strategies. They are all there, ready to be activated if desired.

Users of our auto-scaling features do not need to worry about this complexity, though. We ship with good defaults, integrations with common server software, and monitoring agents that make reporting a breeze. Recent research for automatically tuning these systems according to workload characterization is also ongoing by our research team. In our upcoming post, we will show how to set up our system on City cloud, and how easy it is to integrate with the City cloud load balancer. Stay tuned for updates!

This blog entry first appeared in both Swedish and English over at the City Cloud blog as part of a series on auto-scaling.