As most of you will be aware, we planned an OpenStack upgrade from Mitaka to Newton on Monday 10th July that in the end had to be postponed. With this blog post we wanted to provide you with some more detailed information about what happened including:

  • Why we chose to abandon the deploy
  • Insight into how we deploy
  • Steps we are taking to improve this process

A bit of background

Our OpenStack platform deployment is managed in part by Ansible, or more specifically OpenStack Ansible. We use Ansible and OpenStack Ansible extensively and they allow us to automate installation, configuration and management steps for a large part of the deployment process. So far, they have worked pretty well for us. This automation allows us to deploy far more consistently and with little to no manual intervention which in turn reduces mistakes introduced by human error. It also allows us to create deployment formulas (or playbooks in Ansible terminology) that we can apply over and over to multiple environments and test whether the changes have been successful or not.

Before we make any changes to our production platform, whether it be patching libraries, OS upgrades, package installation or anything else, we first configure a staging or test environment to be as production like as possible. As much as we'd like to, unfortunately we cannot make staging exactly the same as production. We are continually iterating on the process and working to develop tools and techniques that will allow us to improve the accuracy of this, i.e. to make it closer and closer to the live production environment, but inevitably with each change, there are likely to be areas that will diverge. Even slight differences can introduce show stopping bugs and with an ecosystem as large and complex as OpenStack, differences are easy to introduce.

What happened on the day?

Prior to the production release we had configured and tested our staging environment to mimic production, performed the upgrade and fixed any bugs that came up along the way. We were then ready to apply the same changes to our live environment. However when we came to deploy on production, we started getting errors that we had not previously encountered and after debugging the issue, we found that the state files we use during deployment had become corrupt and were missing some data. This meant in the first instance we were unable to deploy at all, and then we weren't confident enough we could deploy safely without further investigation. Because of the corrupt state, we could not guarantee the integrity of the deployment or safety of user data at the time and therefore abandoned the upgrade. In any deployment scenario we will not proceed until we were confident that our user's instances and data are safe and that any potential outage will not be for an extended period of time (multiple hours or longer). We have since remedied the issue and are now ready to deploy with confidence.

How are we going to mitigate this in the future?

Every deployment is a learning experience for us. Even after a successful deployment, we take time to consider what went well, what didn't go well and how we can improve going forward; the lessons we learn in each iteration allow us to make the process better for the next. From this deployment we have learned:

  • Deployment state is vulnerable and therefore needs to be checked for integrity before, during and after deployment. We have created tooling to snapshot state to be committed into source control and check this state against the live cluster during deployment checkpoints.
  • Deployment windows and dates must be kept where possible. We have added further pre-flight checks and dry runs to ensure that deployment tooling is fully operational and ready to go during a given window.
  • Deployment environments are susceptible to corruption and degradation. We will now create automated, repeatable, containerised environments that can be fully tested in and of themselves as a critical part of the deployment process.
  • Fully automated test pipelines including Tempest, functional and non-functional test suites.

Of course we'd appreciate your feedback on how we could improve the process (especially with regards to communicating with you) and are appreciative of your understanding.