The Elastisys Tech Blog

A Fortnight in the Life of the Welkin DevOps Team

Elastisys Tech Blog
March 19, 2021
11:26 am

At Elastisys, we develop our own open source Kubernetes distribution, Welkin, as well as operate it as a managed service for customers. Both product development and operation of the managed service is handled by the DevOps team. In this blog post, we’ll give you an insight in the daily life of the team, starting top down with the operations part.

Managed Service Operation

Operating our managed service offering is all about being proactive and providing a worry-free experience for the customers. In order to do this, we use a combination of automated monitoring and alerting as well as manually inspecting the clusters. The idea is to catch issues early and fix them before they affect the cluster or the workload. Each customer is assigned a Delivery Manager who is responsible for the smooth operation of the cluster.

Way of work

In the DevOps team, we normally assign one on-call engineer each week. The on-call person is the first line of contact for the customers and responds to their tickets and ad-hoc questions during on-call hours. He or she also performs the day 2 tasks on the production clusters. If time permits, the on-call engineer also picks up ops tasks from the backlog and works on them, for example improving operations tooling or documentation.

Logs

To track the operations lifecycle, and simplify compliance audits of each environment we keep a number of logs. These are:

Decision Log
- This log is owned by the Delivery Manager and contains records of all changes that have been made since the environment was commissioned.
Operator Log
- When an engineer makes a change to an environment, a brief log of what was performed is created. The log contains information about the change, the tooling used, and often a shell dump of actually running the commands.
Uptime stats
- We continuously track the uptime of i.e. the control plane, the monitoring subsystem, the sso system, any databases and other components of interest.

Tooling

We use a number of fairly standard tools to make life in the DevOps team as smooth as possible. The most commonly used are:

Jira Service Desk
- Customer and internal tickets
OpsGenie
- On-call scheduling and alerting
GitHub
- Backlog, issues, operator logs, incident reports
Run Book
- Description of processes and practices for the DevOps team regarding operation of the managed service
Google Drive
- Meeting protocols, decision logs

Upgrades and maintenance

For planned upgrades, we schedule maintenance windows for each cluster, normally each month. During the maintenance window we upgrade the underlying OS and Welkin to the latest stable release, including security patches. For unplanned maintenance, a decision has to be made, either internally or in cooperation with the customer. A record of the decision has to be entered in each cluster’s decision log before the change is made.

For both planned and unplanned maintenance, an operator log should be created and uploaded to GitHub.

Issue management

When an issue is detected in a cluster, either manually by the DevOps team, the customer, or automatically by altering, it is the responsibility of the on-call engineer to take first action. For most customers, our SLA dictates a certain response time. Most of the time, he or she handles the issue themselves but sometimes more people get involved, for example if certain expertise is needed or the reason for the problem is not clear. Each problem is different, but we use the same strategy for all of them, Observe-Orient-Decide-Act, i.e. first make sure that you understand the problem before acting on it.

Product Development

Let’s move on to the development side. Our development process is fairly standard, and based on the scrum methodology with 2 week sprints, but let’s go through it quickly anyways. Our main principle is that all work should be measurable and visible, so we try to create issues for everything we do within the team.

We start by gathering requirements, which can come from a variety of sources, for example:

Architecture decision records
- Recurring architecture meetings lead by the architect are held within the team
Customer feature requests
Open Source contributors
Sales or Marketing
- Recurring requirements meetings are held to gather these kind of requirements
Internal requirements
Bug reports

Once collected, the requirements are converted to issues and fed into a backlog. Each sprint, we groom the backlog by going through the issues, prioritizing them and assigning story points.

Sprint lifecycle

Let’s start with the last working day of a sprint, day 10. During this day we demo new features that have been added during the sprint, reflect on what has been good and bad during the sprint, and plan for the upcoming sprint. During sprint planning, we then pick tasks from the “include in next sprint column” populated during the Task Force meeting until we have story points to match the available development resources during the sprint. Tasks for QA and release creation are also included. We also add Ops tasks, to be picked up primarily by the on-call engineer.

Day 1-9 follows a standard scrum pattern, the developers are free to pick and choose from the tasks planned for the current sprint, and each morning, we hold a stand-up. We try to keep the stand-ups short and bring any longer discussions to either a team meeting, a smaller meeting, or ad-hoc on i.e Slack. Sometimes issues need to be escalated to for example the architect, the product owner, or management.

Pull requests

When development is finished for a task, a Pull Request is created in the corresponding repo on GitHub. A pull request lets other team members know what changes that the developer has made, and wants to merge onto the main branch. The team can then discuss and comment on the change, and normally some additional modifications are made before the PR is approved. For most Elastisys products, 2 reviewers need to approve a change before it can be merged. A CI/CD pipeline that sets up a cluster with the changes and runs unit tests, and other checks, is also run on each PR, requiring a pass before merging is possible.

QA and release

At the end of every other sprint, a new version of Welkin is released. Normally, two engineers work on producing the release. A release branch is created and the changelog is reset. After that, the engineers set up two clusters, one from scratch and one that is an upgrade from the previous release. QA is then performed on both clusters, verifying that everything works as expected, from the installation process to the end-user functionality and experience. If bugs are encountered, fixes are normally developed and cherry picked onto the release branch. When all tests pass, the new version is released.

This brings us up to day 10 again, and the cycle restarts.

Blog, Featured posts