At Civo we believe strongly in Sprint Zero - this means that when we kick off a project, before we start to write the actual specific functionality we consider (and often implement) how it will be tested and how it will be hosted. One of the key requirements for us is a zero-downtime deploy. We want to deploy at all times of the day and yes, even on Fridays, so being able to do this without the website or API going in to maintenance mode was critical.

First iteration - Debian packages

The first version of Civo before we really invited any members of the public to join the beta was hosted on regular Ubuntu 14.04 instances using Debian packages. We used a custom-written build system that utilises the Pkgr gem. This gem takes a folder and bundles it up using Heroku's open-source buildpacks in to a Debian package, to be installed on the Ubuntu servers like any other package. Our system also used gitreceive to receive a Git push and passed it to Pkgr to be built, then the system acted as a Debian apt server to serve this to our internal servers.

Initially we ran off a single instance for ease, but quickly knew we wanted to host it in a highly available way. We launched a second instance very soon after, but realised that deploying to both instances was a bit of a pain. Nothing too bad, easily fixed using csshX to run apt update ; apt install civo-dotcom to get the latest version of the software installed on both instances.

However, this still felt very old-school.

Second iteration - Flynn

We had played a little with Dokku at this stage, deciding it's a really useful tool for Heroku-like deployments on a single server. Indeed, we still use it to this day for small internal apps that don't need scaling/HA and aren't business critical. We wanted something similar to Dokku/Heroku but running on multiple machines and without the cost of Heroku (using that would be particularly bad given that we're a cloud-hosting with effectly free server resources).

The first item played with was Deis, but it never seemed quite stable enough for what we needed - to run our business-critical website and API on. We took an active interest in Rancher before it was as tightly integrated with Kubernetes but decided that we'd prefer to stick with a regular Heroku-style PaaS - the simple "git push to deploy" was too attractive to give up.

We installed Flynn, gave it a thorough testing - deploying apps to it over and over again, rebooting the underlying hardware and it was always fine. So switched to it and it's continued to run both www.civo.com and api.civo.com for most of their lives. However, over time things started to break (it now thinks we don't have any MySQL databases even though they're still accessible) and we lost confidence in it after trying to get help with the issues in IRC.

The straw that broke the camel's back

Our Flynn cluster suddenly decided to stop accepting new deployments. Fortunately since that point, we've had no critical issues that needed deploying. The only real problem was that our TLS/SSL certificate was imminently going to expire. We suddenly couldn't put live the new TLS certificate. We worked around this for a week or so by putting a simple but high performance TLS-terminating proxy in front of the platform, but this didn't play 100% nicely with Flynn - one side effect being that Flynn thought all requests were coming from a single IP (that of the proxy). Some users have noticed this, but there's been literally nothing we could do until we got to our third iteration.

Third iteration - Kubernetes

This bring us to NOW. We've been playing with Kubernetes for a few months and have been impressed by it's stability and feature set. I'm personally a big fan of Google's Site Reliability Engineering Paperback by Betsy Beyer, Jennifer Petoff, Chris Jones, Niall Richard Murphy. This describes Google's early adventures and usage of a similar platform to Kubernetes (a forerunner really called Borg).

The usage of this platform requires more input than the single "git push" based deployment we've been used to, but that also gave us the opportunity to build better processes around Continuous Delivery. Using Gitlab's CI/CD we built the ability for our source control server to both store the changes to our source code, build them into Docker containers, deploy them automatically to a staging environment, and with a single click, deploy them to production.

Our experiments in Kubernetes generally went flawlessly. We had one problem where we intentionally deleted one of the Kubernetes nodes from the cluster and recreated one to insert and it wasn't created properly (meaning it couldn't resolve DNS properly and therefore had flaky communications with the rest of the cluster). After we deleted this node and recreated it again, it's been back to flawless.

So in the biggest of Friday deploys in our history, we put live Civo.com on Kubernetes on Friday evening last week. From our point of view it's now a dream to manage the cluster and easy for all of our team to automatically see their changes in a production-like environment and then deploy them to production. Hopefully going forward this turns out to be a great platform choice for us.

And then...

After keeping an eye on the platform all weekend, resolving a few minor blips that then kept coming back, on Monday morning all hell had broken loose. A load of failed jobs in our background processing queuing system, the Kubernetes apiserver processes were restarting every minute or two (then sometimes "backing-off" to give problems a chance to resolve, rather than continually restarting). We determined that it seemed to be one node causing the problem - but the cluster had got in a weird state from when we added it (to test the procedure of expanding the cluster before we went live). After trying to fix it up until about 2pm, we made the decision that it would be quicker just to wipe the cluster, reinstall and redeploy. This took a little longer than expected, but by about 4pm the cluster was back up, freshly installed and all back working perfectly. This time, no flapping of processes, everything seems really stable.

So we have some learning to do around operating this platform, but that's what developers love, right?! Learning new technology, scaling systems, etc. Glad to have you along with us on the journey...