Metric Choice Matters for Intelligent Auto-scaling

What metric, or set of metrics, we choose to monitor greatly affects our ability understand our applications. In this blog post, we shall see that the choice also affects how well we can perform intelligent auto-scaling.

Bad Metrics

Many auto-scaling systems appear to be designed solely with ease of configuration in mind. In a previous blog post, we saw that Microsoft Azure only allows one to scale based on either CPU utilization or Queue length, and does not allow one to define custom data at all. Google considers custom data as input to their auto-scaler to be a beta feature. The vast majority of guides aimed at getting the reader started with auto-scaling shows how to set up CPU usage-based auto-scaling. This is intuitively easy to understand, but there is a big problem with it.

We think this is limiting, and a sign of how overly simple some auto-scaling offerings truly are. The reason for this is very simple:

All that sustained high CPU usage tell you, is that the system is suffering. Not by how much.

The whole point of auto-scaling is to figure out by how much we need to scale up to meet current demand, and taking the appropriate action. If our choice of metric prevents us from doing so accurately and in a timely fashion, no wonder that competing auto-scaling offerings converge so slowly!

To see why this is the case, consider simply that we have a virtual machine that is asked to handle 4 times more than it can. It will be overloaded, the CPU usage will be almost maximum all the time. If we merely note that and scale up by a single additional virtual machine, once it comes up, both will be overloaded. The load is still twice as high as any of them can deal with. So we scale up again, and again… That takes time, and performance is degraded needlessly as the deployment size slowly converges to what it should be.

The general rule here is that CPU usage, and similar metrics, are bad if they:

  • relate to only a single machine, and/or are limited by the capacity of a single machine, or
  • measures a too large part, consisting of several components, of the application.

An example of the latter is to measure, e.g., a single Apdex value for the entire application. If it is too high, it is completely ambiguous what we shall do to fix it. Will adding a new web server help? Is the delay due to the application logic or database layer? We would not know, because the metric is not fine-grained enough.

Good Metrics

Now that we know what bad metrics look like, what is a good metric? Good metrics are ones that:

  • are related to a single application layer (deployed on a number of instances and consisting of services or micro services), and
  • are not capped by current capacity, but rather depend only on end-user usage levels.

Determining response times from the database layer as a whole, for instance, is useful. If we know our application well enough to say that a single application logic instance should be able to deal with X requests per second, we can use the number of load-balanced requests for the application logic layer as a whole as a good metric. The same goes for the front-end web servers.

We get layer-specific insight from metrics such as those coming from a load balancer in front of an application layer, or from the emptying rate of a work queue. They will accurately show by how much our current cloud deployment size is wrong.

Good metrics like these can be tracked, patterns can be learned, statistics can be calculated, and intelligent auto-scaling can use them to ensure your users get a responsive service. Elastisys Cloud Platform does all this, and more, and we will gladly help you get up to speed. Check out our product features and our professional services, where we can help you gain actionable insight into your cloud application.

What metrics do you typically track and base auto-scaling decisions on? Let us know in the comments below!

Leave a Reply