Operational Instances In Spite of Errors

We have blogged about the various ways the Elastisys Cloud Platform keeps your cloud deployment rightsized, and why it is important. But the key to ensuring a good user experience is not just to have the right number of cloud instances, but that they are all fully operational. In this blog post, we look more into what that means and how the Elastisys Cloud Platform can help using its Health Checks, Cruise Control, and Cloud Pools features.

A Wild Error Appears!

Say that your application is deployed in the cloud, and everything is humming along nicely. All instances are operational, end-users are enjoying your application, and heck, the sun is shining, too. Suddenly, an error starts disrupting your application. Perhaps it was a bug introduced by the latest release or a sudden EBS failure. Whatever the reason, the error must be dealt with. Quickly.

Quick Note on Pets vs. Cattle

Cloud computing has let us think think of servers as cattle rather than pets (linked is a rather lengthy but interesting presentation by Randy Bias of EMC/Cloudscaling). If a pet server experiences an error, you log in and gently care for it. Read a few log files, restart a few services, and so on. If a cattle server stops being operational, you replace it. Since you are a dilligent administrator or DevOps expert, of course you try to understand the cause also. The key difference is speed: with cattle, you can be up and running again quickly, and investigate later. Perform a post-mortem, if you will. With pets, you investigate while the problem is still affecting your service because you just can’t kill Fluffy!

Cattle-care 101

Having adopted a modern cattle-centric view of cloud instances, how do we keep our applications operational in the face of errors?

  • Identify errors, so we can act accordingly.
  • Isolate errant instances, to prevent further disruption to the application.
  • Replace errant instances, since isolation means a reduction of your deployment size and your application needs the capacity.
  • Investigate root causes of errors, so we can prevent them in the future. Possibly make entire application use cached data instead of dynamic, to prevent further corruption.

Time is critical for the first three tasks. The more we can handle those via automation, the better. A human operator is the only one who can truly perform the last task, and automation of the first three means more time for the operator to work out the solution.

Operational Support via Elastisys Cloud Platform

How can the Elastisys Cloud Platform help an operator identify, isolate, and replace non-operational instances?

Identifying Errors

The Elastisys Cloud Platform includes monitoring capabilities that place events on a time scale, a time series database. This guides the predictive auto-scaling process, but also serves as a great way to represent when events, erroneous or otherwise, occurred. By logging data into it using its simple API, important events can be traced and acted upon.

Some errors are only noticeable when the system is invoked in certain ways. For that, we offer a health checking framework that help identify such errors by interacting with your application. If errors such as values outside of the norm are found, your operations team is notified and automatic management can begin.

Isolating Errant Instances

Our cruise control feature steps in when health checks indicate an error condition. It can isolate the offending instance, and after doing so, mark it for replacement. Isolation strategies include:

  • shutting down the instance,
  • keeping it running but isolating its network connectivity,
  • unmounting elastic storage units (so they can be mounted elsewhere), or
  • a reasonable combination of the above.

Which action is appropriate is application-dependent.

Replacing Errant Instances

Cloud pools manage the size of your deployment. Their main function is to be told what deployment size is desired, and keep it at that level. But, through their API, they can be told that an instance is in either of the following states:

Instance membership status in cloud pools
 activenot active
evictabledefaultdisposable
not evictableblessedawaiting service

Cruise Control determines the fate of an instance. Should it be replaced by another, it is marked as disposable. Should it be kept around until an operator has investigated it, it is marked as awaiting service. The cloud pool carries out the command, and keeps the deployment at the desired size, operational in spite of errors.

Summary

In this post, have show to keep a cloud application operational in the face of errors. We have seen that error-handling in a cattle-centric view of cloud instances is divided in four tasks: identifying, isolating, and replacing errant instances, followed by error investigation. The first three are time-critical tasks, and we have seen how the Elastisys Cloud Platform can help automate these tasks. Thus, operators get more time to investigate and prevent similar errors in the future. The application is kept operational for end-users.

How do you keep your cloud application operational? Do you view your servers as cattle or as pets? How does that affect your error-handling strategies? Let us know in the comments below!

Leave a Reply