"A long time ago in a data centre far, far away....
Our first episode takes us back to the early days of launching Civo, where we had a fledgling Ceph cluster with a bunch of SSDs in each. Those were darker times, in the early days of OpenStack.
As time went by the cluster grew, new bigger nodes were added along with bigger disks. We learnt about the dangers of crush maps and how everything had to be set up so that it didn’t try writing the 501st gigabyte to a 500 gigabyte drive purely because another one somewhere else had 1024 gigabytes.
Newer versions of OpenStack and Ceph came and were lovingly squeezed on to our cluster (with some lube and a sledgehammer). At some point along the way, there was a new little setting that by default says “frequently, rebalance all the data so that all the disks have the same percentage full”. Of course, we weren’t installing a cluster from scratch, so our setting was missing :(
Fast-forward to present day and we have a few disks failing (they are all the same age, all had roughly the same amount of writes to them).
Then while the data that was written to those disks was being replicated to other disks, suddenly some of those disks became either full or died themselves
Enter a world of pain: filled with days and nights of rebalancing and removing disks and rebalancing… The good news is, about a year before this we’d introduced a new storage platform to our OpenStack cluster, so there weren’t many customer instances on there to experience any slowdowns.
However, that brings us on to…
So a while ago we’d added a new enterprise storage solution and now have a bunch of those nodes in our cluster. We switched the default over to the new storage system, so no new users were creating instances on the old Ceph cluster, and we have to say it was performing amazingly!
Great control over performance, reliable, solid! We launched our KUBE100 beta and our customer base has been growing rapidly ever since. Our new storage solution grew with us and kept performing like a dream. But as with all great success stories, suddenly we hit a road block.
Back in August, due to the huge demand for the closed beta, cluster usage suddenly jumped up from “we’re fine” levels to “Eeeek! Nearly at Error status”.
On this platform, when a cluster is in a green state, all is well. When it’s in warning state it’s still fine. When it hits “error”, it will immediately stop allowing any new volume creation (these include root volumes for instances and Kubernetes nodes as well as attachable volumes).
Existing volumes can be written to (even though they are thinly provisioned), because like a car’s fuel tank - when it says it’s empty, there’s actually a little bit left. So we scaled back some non-essential services, deleted any instances and clusters that we weren’t actually using internally, and then had to reach out to our community to do the same.
We promptly managed to get some new storage nodes in to our cluster and expanded our capacity.
Phew! After a stressful two weeks this finally felt like time we could take a few evenings and maybe a weekend to relax after the long hours spent and stress the team had been absorbing.
If only things went that way… Due to a bug that occurred when the cluster went in to error state, we found ourselves in a situation where existing volumes were fine, but each of our master Linux distribution volumes that we use to clone customer instances had broken.
No problem, we have simple steps and even some automation to rebuild them. No sweat, no big deal.
Then a couple of hours later, customers started reporting that instances and clusters were failing to launch again. Strange, the volumes had not been created - OpenStack’s Cinder thought they still existed, but our storage vendor thought they were gone!
Odd, but let’s just create them again. We’ve never had that before (and haven’t changed any OpenStack component in a few months, or changed any of Civo’s code that calls OpenStack APIs in this area in even longer).
A few hours later, again, they’d fallen into the black hole! We recreated again and realised we need to find a way of stopping this from happening.
So we left each volume attached to a holding instance because customer instances weren’t affected - and Cinder will still let you clone those volumes while attached, to be used for customer instances.
We tested this in Cinder and exactly as suspected, they couldn’t be deleted. Somehow, it happened again a few hours later. At this point, we were convinced it was an issue with the storage platform, but were struggling to get to the bottom of it.
Another option we thought of was to take a snapshot of the volume - again from our experiments we couldn’t delete a volume if it had snapshots. So once again we set about recreating the master images, took snapshots of each and enabled the system to let customers launch instances and snapshots.
Just when we thought it was safe...
On Saturday 5th of September, an outage that should have never happened did. We have four redundant network switches, so if one of them ever has a fault, the other should handle everything. Unfortunately, on that fateful day, one of our switches developed a hardware RAM fault and got stuck. Even though they’re in a redundant pair, for some strange reason this took out networking entirely. Without knowing it was a RAM fault, we gave it a reboot and it came back in to service – problem solved. For 10 minutes. Then it died again.
A nugget of information from the back of one of our engineer's head said that if one of this particular switches gets stuck, they both need to be rebooted as a pair to get them working again. As these are hardware switches, and we were currently having major network connectivity issues, that should quickly resolve it. So we rebooted both and phew, the world was good. For 10 minutes...
Imagine the scene: people racing to the DC, others in Slack channels or on calls - how can something so plug-and-play bring down our entire network? Well, we get there and find it’s a faulty switch. No problem, we have a spare - old one out, new one goes in… and nothing? Reboots and console checking ensue. In a cruel twist of fate it turned out the new unused spare one was dead too! This wasn’t shaping up to be a great weekend.
However, as prepared as we always are, we had a spare-spare, which worked fine. So we’re in the process of investigating how one switch failure took out our entire network (our new upcoming KUBE100 Platform definitely isn’t vulnerable to the same sort of failure, we’ve been testing that!), but after a stressful Saturday normal service resumed.
In the midst of all this, we have had part of the team still working on the all new #KUBE100 platform which we have referenced in our Slack channel and recent community meetups.
For the new platform, we’re partnering with StorageOS to offer us native in-Kubernetes volumes that we’ll attach to instances. The team have already managed to get instance launch times down to an incredible snappy 30 seconds, and that’s with standard OS images!
We want to get our k3s cluster launch times down to sub-30 seconds, but that will be with custom, dramatically cut down images. So it definitely feels like we’re going down the right route with the new platform.
StorageOS have been very receptive to our input in to their upcoming roadmap, and have been awesome at working with us to debug issues and tweak things as we go.
Our partnership feels very much one built for the 21st Century rather than the typical vendor/customer relationship.
So there you have it, the trials and tribulations of running a beta cloud platform. We've learned an awful lot so far, and as the team grows, we're closer than ever to realising the full potential of the Civo.
Stay tuned for more info about the new #KUBE100 platform, watch this space.