It's finally time to come out of beta and launch Civo into the world as a production platform! Exciting times, but with some big challenges, major changes, upgrades and knock-on effects for all users. Please read as this affects you.

Introduction and overview

As valued users of Civo, you will (hopefully) be aware that we have been in beta for some time. This means that while we have been treating it as a fully production-ready platform, there has been the very real risk of outages and other errors and issues, including data loss. Thankfully outages have been minimal and we've had no data loss during the last few years, even whilst in beta. Now it's coming close to the time when we remove the beta flag and open Civo to the public.

We'd firstly like to thank you all for helping us to improve Civo, without your input and testing we wouldn't have been able to get where we are today. We are now preparing to come out of beta and launch Civo as a full production platform, ready to accept production usage. There is however going to be some significant work carried out on the platform in order to get it to the required standards that will affect everyone to some degree.

We have been working hard over the last few months testing various updates and changes that will need to be applied to the underlying OpenStack platform before we come out of beta. We've run through the process many times on our testing hardware and are confident in the process. However, with so many moving parts, when switching between environments there is inevitably going to be unforeseen differences that may cause issues of various severities. When I said "significant" earlier, it is somewhat of an understatement. We need to essentially rebuild the entire platform from underneath itself whilst maintaining the current levels of service with minimal outages and zero data loss. This is by far our biggest challenge to date.

What are we doing?

Operating system upgrade

Firstly we are going to be upgrading the operating system of every machine - physical, virtual and container in the OpenStack platform from Ubuntu 14.04 Trusty to Ubuntu 16.04 Xenial. 14.04 goes end-of-life (EOL) in April 2019 and we therefore need to upgrade to the latest long-term-stable (LTS) release. But more pressingly, the versions of OpenStack that we need to upgrade to are only supported on 16.04. So before we can upgrade OpenStack, we need to upgrade Ubuntu.

This is not an easy task and there is no "built-in" upgrade path for this in OpenStack Ansible. The move to Xenial brings a number of changes, the most noteworthy of which from the platform perspective is the use of systemd which affects essentially every component of OpenStack. We will therefore be taking down services, upgrading and then redeploying bit by bit including all OpenStack APIs, databases, queues etc.

OpenStack upgrade

Our OpenStack platform is currently on "Newton" and we need to get to "Queens", which means going through "Ocata" and "Pike". That is a total of 3 upgrades Newton -> Ocata -> Pike -> Queens which will leave us on the latest LTS version of OpenStack.

For more information on OpenStack versions, see: https://releases.openstack.org/

Ceph upgrade

Ceph is currently on "Jewel" which went EOL in July, we now need to upgrade to "Luminous" which is the next LTS release.

For more information see http://docs.ceph.com/docs/master/releases/schedule/

How will our users be affected?

During these upgrades our users will be affected in a number of ways. The majority of changes will not affect the runtime of your instances, i.e. your ability to access and use them. You may however experience performance degradation and are likely to have periods when you will not be able to modify your instances (create, update, destroy etc).

IO wait time

During Ceph upgrades, each node in the cluster will have to be taken offline and this will cause the cluster will go into a warning state while it tries to rebalance its data. This more often than not causes an increase in iowait, so disk reads and writes may be slow. This needs to be done on only a few nodes at a time to prevent the cluster going into an error state and blocking reads and writes altogether. This process will mean that disk IO will be slow at various times for about a week (although not on all instances at once). You will still be able to use Civo and have full access to your instances during this period, they may just be slow and as mentioned you are likely to see increases in iowait in top or similar.

API and Dotcom unavailability

Part of the upgrade will involve upgrading OpenStack itself and the operating system it is deployed on, which means that we need to take down its APIs and internal services. The services are configured to be highly available and resilient and so we should be able to take them down and upgrade without any loss of service. There may however be times during the upgrade where services failover between one another causing some outages to the Civo website and API. There is also likely to be times (but we'll try to make them out of UK office hours) when we need to stop all reads and writes to the APIs while we perform major upgrades to all databases in order to preserve data integrity.

Network unavailability

The networking aspect of OpenStack is also configured to be highly available and resilient, however in our experience, when a network node goes down, there is a noticeable slowdown in performance and sometimes outages, especially during busy times when there is a lot of network traffic. Therefore these changes will be done out of core UK office hours to minimise disruption. This should be the only time during the upgrades that there is any outages on your actual instances e.g. that you will not be able to connect to your instances or serve responses. This will not affect all instances and may have no impact whatsoever on you, however if your instance is on a busy network node while it is taken down and fails over, you may experience some loss in connectivity while the HA kicks in.

Why is this happening? Will it happen again?

We are currently a small, self-funded start-up and as such do not have the luxury of lots and lots of spare hardware and capacity. The ideal way to do this would be to move instances between availability zones / clusters / cells and then upgrade part of the platform, then move instances back with zero or negligible performance degredation. As we expand after the beta, we will be adding multiple clusters, cells, availability zones and regions, which will give us more flexibility should we need to do something of this scale again. We have reached a difficult milestone that we must get past in order to grow, but it will mean in the future we should never have to do something like this again.

When is this going to happen?

Below is a draft schedule of what is going to happen and when. The dates are subject to change as we do not know for sure how long each upgrade will take and there may be blockers that extend dates. Equally, we have allowed for problems and therefore progress could move quicker should everything go smoothly.

17th Sep - 21st Sep

The first part of the upgrade involves upgrading the operating system of Ceph and OpenStack's management components which includes its APIs, databases and queues. We will also be upgrading Ceph. This will involve reboots and in many cases completely rebuilding hosts.

Services affected

  • Ceph
  • OpenStack APIs
  • OpenStack databases and queues

Possible issues

  • Disk slowdown and increase in iowait
  • Inability to make changes to Civo resources (create, update, destroy)
  • Snapshots delayed
  • Service outage on Civo API and dotcom (should be confined to out-of-hours: 18:00 - 06:00 GMT)

24th Sep - 28th Sep

During this period we will be upgrading the compute hosts. This should cause minimal disruption, if any at all, as OpenStack provides "live migrations" which allows instances to be moved between compute resources. There is always the possibility of disruption but it is highly unlikely.

Services affected

  • Compute

Possible issues

  • Performance degredation (should be minimal if any)
  • Small possibility of instance reboots (although highly unlikely)

1st Oct - 5th Oct

As mentioned earlier, networking is one of the few components that will likely cause an outage for instances. This will be done out of hours. This step should be done in a day or two, but there may be complications and as such, extra time has been allowed.

Services affected

  • Networking

Possible issues

  • Network slowdown on instances, possible loss of all network connectivity to instances: HTTP, TCP, SSH etc. (should be confined to out-of-hours: 18:00 - 06:00 GMT)
  • Service outage on Civo API and dotcom (should be confined to out-of-hours: 18:00 - 06:00 GMT)

8th Oct - 19th Oct

The final part involves upgrading OpenStack itself. This is an automated but time consuming process. All services are affected. While there shouldn't be any downtime or slowdown, there is a possibility that things may go wrong which we will have to fix as and when.

Services affected

  • Ceph
  • OpenStack APIs
  • OpenStack databases and queues
  • Compute
  • Networking

Possible issues

  • Disk slowdown and increase in iowait
  • Inability to make changes to Civo resources (create, update, destroy)
  • Snapshots delayed
  • Service outage on Civo API and dotcom (should be confined to out-of-hours: 18:00 - 06:00 GMT)

Status and updates

During the upgrades, we will be updating our status page and Twitter with any outages or issues so you can stay informed as to what is happening should there be any issues.

Thoughts, feedback and concerns

If you have any thoughts, feedback or concerns about the upgrade, please get in touch and we will be happy to discuss.