When I started working for this company back in 2003 we used to upgrade our production servers pretty regularly every six months. A couple of times we decided to go nine or twelve months between upgrades, and in one very particular instance when we had to replace a core data storage system we went a whopping twenty months between releases.
Today we release to production four times per week (Monday through Thursday).
So, what happened? And perhaps as important – why?
The dark ages.
Looking at where we were between 2003 and 2012, the reality of our six-month development periods between releases was that very little of those six months were spent developing.
The releases required downtime, so they would always take place over a weekend. The dates were communicated a long time in advance so customers could plan around it. Because of the impact to customers and the need to plan staffing in the weekend, changing these dates was something we really didn’t want to do. I think we did only once.
With the release date set in stone we needed our own internal “code-freeze” date in order to accommodate for the test period. All functionality that was to be in the release had to be finished at this time. Now, we all know software companies are great at estimating, and combined with the fact that anything not included in a release wouldn’t be out there until half a year later this did of course result in a massive code churn in the days leading up to code-freeze. A lot of that code was not really finished, but hey – we have the entire test period to make sure it actually works.
Based on our experience the code-freeze needed to be seven weeks before the release date. That was how long it would take to get a successful run of what we called our “systemtest”, basically a huge manual regression test that would take an experienced tester at least four days to run if everything worked correctly. Which it certainly never did on the first try. Even on the second run we would often have blocking errors. In addition to the systemtest we had a long list of smaller manual module tests, ranging from a few minutes to many hours to run. With a limited number of testers available, all developers would also be assigned to run their share of tests unless busy fixing issues already reported or trying to get our automatic tests green again after the frantic commits before code-freeze. Typically towards the end of the seven weeks we would have run three or perhaps four iterations of the systemtest, and the result of the last run would be a success or at least the fixes for the last errors discovered could be verified without another full run. All issues from the module tests would be fixed and verified. And so we released.
Of course, discovering all issues even with rigorous manual testing is practically impossible, so inevitably the first three to four weeks following a release would consist of firefighting and hotfixing.
Take then six months, subtract seven weeks of testing and four weeks of hotfixing. And on top of that when we say we released every six months that does of course not mean we didn’t deploy any new code to production between those releases. Critical bugs had to be fixed, which meant going through a very cumbersome manual process of backporting fixes to an old codebase, build, test and deploy the fix – using a different deployment method than the main release, just to make things easy. So yes, our six months of development consisted of precious little new development.
The biggest bang.
Having realized this for some time, but not dared to do much about it since this test period was in fact the only safety net we had separating us from proper chaos, the catalyst for change came after our second 2012 release. That is when we had to replace one of our core data storage systems. The nature of the change was so critical to all our functionality that we couldn’t possibly deliver it until we were sure the implementation was complete and working, and with our working and deployment processes the way they were, during this period we couldn’t deliver much more than a few bugfixes without incurring a lot of pain due to the manual nature of providing between-release changes.
So we went twenty months without significant feature updates. Which is of course bad, but might also have been a blessing in disguise. Since we wrote a lot of new software this time, as opposed to just poking around in a legacy codebase, this was a golden opportunity to write good automated tests at least for all the new code. At the same time we also worked on improving the automated tests of the old code. So when this release finally happened in 2014 after one final manual test period, we decided that from now on there will be no more manual test periods. Also, we will no longer backport targeted fixes. If a critical error is discovered the fix for it will be delivered along with any other new code that has been deemed production ready. And with that we were suddenly doing continuous delivery. At first with regular releases once per week – and a few irregular ones for critical issues.
That sounds a bit easy, and perhaps even like a huge risk to take. However, we had been quite busy preparing for it for the last twenty months, so even if it felt like ripping away our safety net in one fell swoop, there was fortunately a new one already in place.
Work, work, work.
We started taking our automated tests seriously. We had gradually started doing so already, going from only running and fixing them during the test period in 2003 to monitoring them closely throughout the year by the time 2012 came around. Still, especially with a long time to go before a new release, it was too easy to get away with letting the tests slip into a bit of disarray and difficult to instill in people’s minds that keeping them green should be a priority. Of course, it didn’t help that at one point we had a test battery that took eight hours to run and was notoriously unstable. Even when we split the tests in two and doubled the number of build machines it was still a huge pain.
At lot of work has since gone into this, so from running tests sequentially on a single build server where the application was set up locally by the build process, we’re now running tests on twenty or so build agents in parallel towards an environment which is scaled to handle that amount of load. That environment is deployed to using the same deployment process we use to eventually deliver our software, giving us a way to continually test that as well. In the past the test of the actual deployment was more an afterthought once the automated tests had finished – and then we would often discover that the deployment process had not been kept up to date with changes to the configuration done during the build.
We had a lot of challenges with tests that were not written with parallelism in mind. Many relied on other tests to set up their data, for instance. A lot of rewriting ensued, and the result is that the tests that used to take 8 hours currently run in 15-20 minutes.
Part of the reason we could end up with such a seemingly unmanageable mass of tests is that our code base historically has been a large monolith where everything ties into everything. All new modules we’ve made after 2014 have been created as microservices that can be built, deployed and tested independently of the old monolith. Some code that was suited for it has been ripped out of the monolith into separate microservices. This is a major improvement, especially for those who work only on new code, as building and testing a much smaller piece of code is typically both faster and more stable.
In addition to our automated tests we also had to do something about our deployment process. What we had was a large windows installer package that required manual intervention and would make all functionality on a server unavailable for quite some time, no matter how small the change to be deployed was. This was a large reason why we couldn’t regularly deploy more than once a week at first. Even at that schedule I’m pretty sure our Operations team was quite fed up with clicking through installers and manually taking servers in and out of load balancer groups by the end of 2014. We needed something easier with more automation, and we needed to support zero-downtime deployment in normal day-to-day scenarios.
Rather than installing everything in one large package, we went with an approach where we split the system up into logical modules and let each of those modules handle their own installation. We ended up selecting Octopus as a new tool for deployment, which is a very good fit for the On-Demand SaaS sites we control ourselves. It has given us some more challenges when it comes to deliver our software out of the house as you can see here. In our case Octopus will take nuget packages, which is what all our different modules produce, unpack them and run any PowerShell deployment scripts contained within. Some code was changed to allow existing long-running processes to finish even after a new version is deployed. Combined with an effort to always be backward compatible, this now lets us do zero-downtime deployments by the click of a single button. Except in cases where we have to break backward compatibility, but that is quite rare.
All this work with tests, deployment and breaking up the monolith wouldn’t be enough without another crucial element: Change of mindset.
All developers now know that when they have committed their code and a release has been made, then that piece of code will go into production. Most likely already the next day. Committing something without really knowing if it works because it will be tested later is no longer an option. All steps that are needed to ensure the quality of a piece of software, be it automated or manual tests, must be performed by the team responsible for it before that software is considered finished and delivered to be released. Everyone must take ownership and follow their own code all the way into production. And not only does this apply to your own personal changes, everyone on a team must care about all changes to a module that team is responsible for. This also means that every module must have a defined team that owns it, and this team must possess a combined skillset that allows the team to be able to take this module all the way from idea to seeing that it runs successfully in production.
Now that we’ve done all this work, what do we get out of it?
For starters, we no longer have that period of close to three months where everyone is first stressed because of an arbitrarily selected code-freeze date, then rushing to get everything tested and fixed during the test period, and then firefighting afterwards.
We no longer have code that lies dormant for months awaiting a test period, only to poke its’ buggy head up when the developer has forgotten what that code was all about. Developers get to finish what they do when they do it, and clients get the changes they have requested when those changes have been implemented rather than months later.
If an error starts occurring right after a deployment it is a whole lot easier to figure out which code change has introduced the error when you have only deployed one day’s worth of changes since the last deployment rather than six months of work.
We have a lot more monitoring in place now to quickly spot any issues that happen to creep all the way into production, and the operations team is much more involved in the development process.
Are there no problems at all with this, then? Well, inevitably errors will slip through the cracks. And when they do, and you deploy all the time, it will appear to customers as if you’ve broken something – again – and the system is less stable than it used to be. Although the number of errors go down – and it has – customers are more likely to see the issues as disruptive and annoying than they did when there was a much higher number of issues happening right after a release, probably since that was largely considered to be expected. So it is not good enough to simply do better, we have to do a lot better. This means it is important to learn from mistakes and at least ensure the same types of errors are not repeated. And make sure you have the tools in place to quickly discover issues before most customers do.
Are we there yet?
So after three years of continuous delivery, are we done? Not quite – we have had weekly meetings discussing the state of our delivery and deployment process all the way, and for some reason the task list doesn’t seem to get any shorter.
We still have to work on the stability of out automated tests, as unstable tests not only steals a lot of time forcing us to re-run tests when there wasn’t actually a problem, but it is also really important that everyone trusts the tests and care about their status. The more brittle they are, the easier they become to disregard.
We also have a challenge with our old monolithic structure. The tests alone for this large part of our codebase takes as mentioned upwards of 20 minutes even after a lot of optimization, and in addition to that there is compilation and deployment, which means the build and test pipeline for the monolith is much heavier than we’d like. And we still make too many changes to this monolith, even though it is supposed to be mostly legacy code. We need to keep chipping away and get more of the actively maintained code into separate microservices.
Also, why stop at deployment once per day? Well, at the moment we run some tests during the night outside of the build and test pipeline. This is preventing us from delivering at other times than after manually verifying in the morning that these tests ran OK. So really our continuous deliveries are not continuous, yet. We have a plan to get rid of these nightly tests, though, which will take us closer to true continuous delivery, and one day perhaps even continuous deployment, where operations won’t even have to press the one button. If we keep improving our processes and mindset then perhaps in just a few more years we’ll have built up enough trust for that.