In addition to offering a fully-managed Kubernetes platform as a service based on Welkin, Elastisys also offers DevOps consultancy services. With permission from our client, this blog post shows the work our DevOps engineers have implemented.
Key Results
Single click
on-demand deployment
💰
Significant operational cost reductions
🤑
-25% in overall infrastructure costs
❤️
Increased
reliability
Background
A problem encountered with trunk-based development against a shared environment, is that the trunk and the environment is frequently broken. It is common for integration testing to happen after new code has been merged, allowing possibly environment-breaking bugs to slip through. In cases where integration tests are slow, some teams use merge trains, which can obscure which merge actually broke the environment. This interrupts other developers, and makes it difficult to know if a task is actually completed, or if it is unfinished and introduces bugs to the application.
In order to stabilise the trunk and the shared environment, the idea of temporary feature environments was born. Could the new features be deployed and tested before merging to trunk?
The Solution
Arriving at the current solution was a multi-step process, and the result of leveraging previous IaC and automation improvements. The pieces needed are:
- An autoscaling Kubernetes cluster in a cloud environment, such as GKE on Google Cloud.
- Helm chart(s) or manifests for deploying required databases, message queues and other
- Flexible application Helm charts and application configuration – everything environment-specific should be parameterized
- A CI/CD pipeline set up for deploying to Kubernetes (push-based GitOps)
- A wildcard DNS record, such as *.preview.mycompany.com, pointing to the load balancer of the cluster
- A matching wildcard certificate
Because we were already using Google Cloud for other environments, we decided to go with a GKE cluster set up with an autoscaling node-pool. This cluster is managed in an IaC fashion, but a proof-of-concept cluster can easily be created with a few clicks in the Google Cloud Console.
The Goal of Empowering Developers with Self-service Environments
Now that we had a Kubernetes cluster, the next step was to decide how to handle deployments. At one point, the idea of an optimal solution was phrased as “a developer should be able to create and deploy to a preview environment with the push of one button”. This turned out to be a very useful goal to aim for during the design of the feature.
The first question that popped up was where to put this button. Since one of the main use cases was to test out a merge request before merging, the CI system turned out to be the most natural place to put it. We already used the CI/CD pipelines to deploy to Kubernetes. Here we use Gitlab CI/CD, but most CI/CD tools should support similar functionality. We ended up implementing the button as a manually triggered job, like this (.gitlab-ci.yml):
"deploy:preview":
stage: deploy
rules:
- if: '$CI_COMMIT_BRANCH == "master"'
when: never
- when: manual
allow_failure: true
script:
- bash deploy-preview.sh
This adds an optional job as part of the deploy stage, that is triggered in the GitLab UI. This job is part of the pipeline of a backend repository, and deploys the version that corresponds to the commit.
We already had a job for building images, but it was only triggered on merge. So we added the required rule to trigger it manually, using the same when: manual rule as above.
Figuring Out the Details
Now that we had our deploy button, we needed to figure out how to handle the remaining problems, how to:
- Refer to the different deployment parts as a whole
- Seed the database with test data
- Handle access to the environment – DNS and HTTPS/TLS
- Manage cleanup and minimize costs
To solve the first problem, we use a unique identifier based on the feature branch, removing any incompatible characters (a slug). This is used to refer to different deployed components, and in the URL used to access the environment: https://<branch slug>.preview.mycompany.com. Kubernetes namespaces use the format <branch slug>-backend, <branch-slug>-frontend, and so on.
For the actual deployment, we use Helm, but Kustomize should work just as well as long as all of the resources are created inside the unique namespaces.
The deploy script does one more thing than just configuring namespaces and running helm upgrade –install. Once the deployed database pod has spun up, the deploy script connects to it and inserts initial data. This dataset is configurable and can be anything from a small dataset from the GIT repository, to a huge database dump from a storage bucket.
With the new preview environment up and running, we needed a way of quickly making it accessible to all users, for all use cases. We solved this problem by using a wildcard DNS record in combination with a wildcard certificate. With this solution, *.preview.mycompany.com points to the load balancer of the cluster, and routing is done based on hostname by the ingress controller. For example, an Ingress for the preview environment myfeature.preview.mycompany.com would look something like this:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: my-application
spec:
rules:
- host: {{ .Values.hostname | quote }}
http:
paths:
- backend:
service:
name: my-service
port:
number: 8080
tls:
hosts:
- {{ .Values.hostname | quote }}
For HTTPS, we generate a wildcard certificate for *.preview.mycompany.com using the DNS01 challenge. DNS01 requires the usage of DNS credentials to prove domain ownership – this is done in a safe environment to not unnecessarily expose them. Both cert-bot and cert-manager support using webhooks to automate this validation for any DNS provider.
Lessons Learned: Automatic Cleanup
One thing we learned in an approach we used previously, was that having the users of a temporary environment responsible for making sure it is deleted when no longer used, does not work. This frequently results in unused cloud infrastructure resources being unnoticed for long periods of time, incurring unnecessary costs. To deal with this problem, we decided to shift the responsibility. Instead of users being responsible for deletion, they are responsible for preventing deletion of their environment. This was implemented in the following fashion:
- Each Kubernetes namespace that is a part of a preview environment is given an annotation, “deletion-date“, with the current date.
kubectl annotate namespace $NAMESPACE deletion-date="$(date -Idate)" --overwrite
- A Kubernetes CronJob is triggered every evening in the cluster, deleting the resources contained in namespaces where the “deletion-date” matches the current date or has passed.
# Simplified example
DATE=$(date -Idate)
NAMESPACES=$(kubectl get namespace \
-o jsonpath='{.items[*].metadata.name}')
for ns in $NAMESPACES; do
deletion_date=$(kubectl get namespace $ns \
-o jsonpath='{.metadata.annotations.deletion-date'} \
|| true)
if [ -n "$deletion_date" ] \
&& [ ! "$deletion_date" \> "$DATE" ]; then
kubectl delete namespace $ns
fi
done
- An additional CI job (button) in the GItlab UI is added, that sets the “deletion-date” of the corresponding environment to a few days in the future.
To anyone who has seen Lost, this setup should be familiar. Every few days, someone needs to push the button in the GitLab UI that extends the deletion date, or the associated preview environment will be wiped. With the addition of scaling down all pods nightly when the environments are unused, this turned out to be surprisingly effective.
Note: GitLab CI/CD does offer some built-in functionality for managing temporary environments per merge request, such as deleting them when the branch is merged. However, it did not offer the flexibility we needed.
The Result
- Single click on-demand deployment
of the entire environment in a predictable manner, including database population, facilitating fullstack QA and integration testing of new features. - Significant operational cost reductions
since application developers no longer need to request environments from their administrators. - 25% decrease in overall cloud infrastructure costs,
thanks to the on-demand nature of these environments. - Increased reliability,
because environments could be used to perform upgrades on platform components and assess the impact of such changes easily.
Our objective was to enable earlier testing and integration between frontend and backend, to be able to keep the trunk branch healthy. A common goal in trunk-based development is that any commit in the trunk should be deployable. Before the usage of preview environments, a broken dev environment (due to the trunk being automatically deployed to it) and a broken trunk branch was not uncommon. After a few months of using preview environments, as well as other improvements to automated testing, a broken dev environment was almost as rare as issues in production.
We had definitely achieved our goal, but it turned out that there were multiple other benefits to having this feature. The preview environments were effective not only for testing application code, but also for testing configuration changes that require a complete environment. Version upgrades of infrastructure components could be tested separately, such as databases, message queues and others.
One thing that surprised us was how stable these environments were. We had expected them to break down regularly, be incompatible with new changes, and require manual attention. However, this almost never happened. Deployment mistakes rarely happen when deploying is just a button push in a UI. Due to the short lifespan of the environments, configuration drift was not an issue. As a result, we were able to greatly cut down the hours spent on maintaining environments and assisting developers with deployments.
Another unexpected benefit was infrastructure cost reduction. Many types of environments, such as dev, demo or test environments, are only used for a certain period of time during the day. If it’s on-prem, it may not matter that much, but if the environment is on a cloud platform, you are paying for 100% even if it is only used during office hours (about 33% of what you pay for). By replacing seldom-used or semi-temporary environments with preview environments that are automatically removed or scaled down during off-hours, we are able to reduce cloud infrastructure costs in the following months by 25%. At the same time, the usage of these environments increased, due to how easy it was to create them.