What causes production application outages? Or better yet, WHO causes production outages? If you follow enough thought leaders on Twitter, you might hear about things like “Contributing Factors” or “Root Causes.” Although the latter has definitely fallen out of favor as a way of describing the cause of production outages. Outages are too complicated for there to be any single “root cause” to an outage. Outages can be caused by many different factors, such as decisions made by the executives and the business. The company might have chosen certain places to invest (or lack investing) and architectural choices made by engineers at the company can have an impact. Even individual decisions made balancing product needs, business needs, and technical requirements.
Over the past six months at CloudTruth, we have been chatting with numerous companies, from early-stage startups to large established enterprises. We’ve spent time learning about their outages, and what has been some of the contributing factors. The main question we are trying to answer is exactly how often does a change to a configuration file cause a production outage to the application. What we learned was probably not shocking to anyone who has been in the Operations or DevOps/SRE role at a company. Still, for some companies, more than 50% or greater of their production outages had a “configuration change” as a primary contributing factor.
Why is this the case? Shouldn’t making a configuration change be an inherently low-risk change? Obviously, it depends on the changes you are making and to which system. Are you changing memory usage settings for a database? Will the database need to be restarted to apply those changes? Will your configuration management automatically restart the database when a new configuration change is put in place? You can see how quickly and easily something that may seem like a minor “one-line change” could explode into a significant production outage. But why is this the case? How did we end up here? Or have we always been here, but the tools we currently use are so advanced it makes it far easier to go faster and make bad decisions?
Nearly a decade ago, tools like Chef and Puppet were still in their infancy. And the thought of deploying applications and servers to Amazon Web Services or other cloud providers was only done by those who wanted to gamble on a new way of computing. Back then, the most significant problem most companies had was the time it took to provision single (often physical) servers. For various reasons – mainly due to security and software dependency reasons – applications traditionally lived in a “one app to one server” ratio. Of course, this was no way an efficient way of delivering applications. Still, it’s what we had available to us at the time. Many of the pre-docker container technologies were too buggy or too complicated for the average operator to have success with.
In those days, many, if not most, applications were deployed as Monoliths and would scale vertically by adding more resources like CPU and RAM to your servers. Debugging these applications was, in some ways, more straightforward, and the impact each configuration change made on those systems was limited in its blast radius.
Ten years later, containers and microservices have taken over as the de-facto way to deploy software. Even companies not ready to fully embrace frameworks like Kubernetes can still take advantage of containerizing their application by running a single app per cloud compute instance. This allows companies to optimize their cost of cloud by running the exact instance type to support their workload. But the downside to this evolution of infrastructure and application deployment model is an absolute explosion in complexity in the systems we as SRE and DevOps folks are tasked to keep reliable.
One example I like to use when discussing configuration complexity was back when Chef moved away from the traditional Chef repository monolith into smaller and reusable cookbook components. On the one hand, you had a group of early users who stored all their chef roles, environment, and cookbooks in a single repository. You sacrificed flexibility for the control that the monolithic repository gave you. At a previous company, I worked to move our single monolithic git repository to a “one cookbook per repo” world. We wanted to ensure we were following what the community was building. But in actual practice – we ended up with hundreds of new repositories. Even with a mature CI and CD pipeline, we struggled to understand the scope of changes as everything was spread across too many repositories for any one human to understand.
Every company I have ever worked at has had a dream of ensuring 100% of their configs were stored in a single place. Maybe first, we started with Puppet – and we put all the configs there. Then we outgrew Puppet – and moved to Chef. We became jaded with Chef and moved everything into Terraform. Now with Kubernetes – we want to move everything there. The problem is that old configuration management systems never actually die. With a high churn rate of operations engineers, you may end up in a scenario where your company might use ALL of the types of configuration management. And the employees who are currently working there didn’t set any of it up. They might be the 2nd or 3rd generation of employees into that company. So any hope of understanding why “that one application” still needs to be configured with Puppet or Chef when the rest of the company uses Terraform has been lost to history.
Provisioning infrastructure doesn’t matter anymore. Anyone with an Amex can do that in minutes. What really matters is lifecycle management – cascading a configuration change to provision a new VPC, to roll a database connection string. You have a laptop that is left in a Lyft or Uber, and you need to rotate credentials everywhere. All the things that need to happen from Day 2 or Year 2 into perpetuity. That is where things get weird, and oftentimes, folks have a difficult time with it.
Some companies use Consul or Etcd as their “single source of configuration truth,” but that can be an involved process to implement. Many companies may not want to undertake the operational burden of those systems or simply may not have the time or technical expertise to succeed with them. Lastly, some companies may see the same future end state as they’ve seen with other configuration management evolutions. A scenario where they never end up entirely replacing the previous system and are now stuck with two (or more) sources of configuration truth.
So, where does this leave us? When it comes to hiring engineers – You can hire someone pretty quickly who can write YAML, and do Chef or Ansible. But finding someone who knows how to use the tool and understands how it works and can use it to build what the business needs. That is the critical differentiator between engineers and great engineers. Given what we’ve heard in these conversations with companies and experts in the space, it’s unlikely we are going to see any changes in how we store configuration information. Every team is going to have their own preferences, and who are we as Operators to force them to change? This is where Cloud Truth can help bring clarity to all the changes that are happening in your complex distributed systems.
The future of how we manage our configurations isn’t the hampster wheel of “rip and replace the old one.” But creating a single record of configuration truth where operators can go when issues arise. Where engineers who are on-call can use as the first place to understand who did what when. A place where engineers can quickly identify which team or person might have implemented a recent change and resolve issues without wasting time on troubleshooting. A place where security and compliance teams can ensure only the right people are making appropriate changes. A single record of change is what most companies need to understand what is happening across their systems.