Many organizations invest in high availability and disaster recovery for their key applications. Too many of these organizations, however, forego the most important aspect of this process—testing the failover process regularly. Whether gripped by the fear of downtime or dreaded DNS problems, development teams are frequently hesitant to test out what they’ve built in the real world. This paradigm holds true whether you’re running your workloads in your own data center or in a public cloud with providers such as Microsoft Azure or Amazon Web Services (AWS).
Are you a DBA trying to ensure your databases can stay online during a server failure or OS upgrade? Or a network architect trying to design a WAN to tolerate the loss of an ISP? It can be challenging for IT professionals to think beyond their specific focus area when planning for disasters.
Chaos engineering challenges these fears. The aim of this seemingly destructive practice is to help development teams identify vulnerabilities in their architecture that come to the surface during these generated system failures. The practice brings together a cross-functional team to help companies understand the blast radius of applications within their production environment and how they can use automation to failover seamlessly when outages happen.
A Brief Chaotic History
As many major web companies built distributed systems on a larger scale in the late 2000s, concern about failure tolerance took more of their attention. This issue was highlighted when workloads migrated into the public cloud, which in its early days frequently had transient failures.
Chaos engineering was
pioneered at Netflix in 2010, where they developed a service called
Chaos Monkey, which would randomly terminate VM instances or containers in the production environment. The disruptions caused by Chaos Monkey are designed to emulate real-world events, such as server or data center failure. With this tool, the company aimed to ensure the termination of an Amazon Elastic Compute Cloud (EC2) instance wouldn’t affect the overall service experience. The capers of Chaos Monkey forced engineering teams to build more fault-tolerant solutions.
Chaos Monkey builds on the concepts of
site reliability engineering (SRE), which includes planning downtime budgets for services within your architecture. SRE also involves building fault-tolerant and resilient production systems for when other systems or services fail. A simple example of chaos engineering would be designing a web server to display a static page in the event the database server is unavailable.
In 2011, Netflix took Chaos Monkey a step further by introducing a toolset called
The Simian Army, which could inject more complex failures into applications beyond the loss of a VM or container. The new toolset allowed further improvements to infrastructure and reductions in outages caused by transient failures.
Netflix made Chaos Monkey an open source project in 2012, and in 2014 introduced the role of Chaos Engineer. One of the goals of chaos engineering is to limit the impact of a single service’s failure on other services, otherwise known as the blast radius.
As chaos engineering continues to evolve, more vendors have entered the space, such as
Gremlin. AWS also has built-in fault injection simulators, allowing customers to assess the impact of outages in their environments. System failures and gremlins, large and small, happen in organizations regularly. The benefit of chaos engineering is you can discover and eliminate single points of failure, build muscle memory around incident response, and improve overall system resilience.
How Does Chaos Engineering Work?
Chaos engineering focuses on designing systems to be more resilient while taking advantage of cloud-native systems like Kubernetes (which are already fairly robust) and ensuring they have even fewer single points of failure. Chaos engineering also can involve testing
distributed computer systems to understand how they tolerate unexpected disruptions. While distributed systems may make you think of large-scale Hadoop clusters, a distributed database or basic web server connecting to a database can be considered a distributed system. Testing for failures, including these basic systems, can allow your organization to build more robust applications.
And in organizations that don’t develop all their applications, understanding your failure mechanisms can significantly reduce the amount of downtime associated with incidents. At the least, chaos engineering helps you know the best order to recover systems and how many you can recover in parallel. Such simple insights can be of tremendous help.
Basic Principles of Chaos Engineering
While the technology large firms like Netflix and AWS use to implement chaos engineering can be complex and overwhelming, organizations can still embrace two basic principles of chaos engineering to help improve system resiliency:
- No system should ever have a single point of failure. A single point of failure refers to a component or application whose failure could lead to an entire system failing and cause several hours (or more) of downtime.
- Never be 100% confident your systems don’t contain a single point of failure. You need an effective way to ensure you have no single point of failure. (This principle led Netflix to develop Chaos Monkey.)
In the real world, many SysAdmins and DBAs don’t own the source code for all of their applications and don’t necessarily have end-to-end control of their platforms the way Netflix or AWS engineers might. However, this doesn’t mean they can’t embrace chaos testing as part of their engineering process. I’ll explain how.
Chaos Engineering vs. Chaos Testing
In my experience with disaster recovery at firms large and small, implementing chaos engineering often starts with a commitment to chaos-testing what happens in the event of major failures.
While this procedure may seem obvious, the most rudimentary disaster recovery testing still requires adding more hardware (or cloud services) and—more importantly—enough IT staff to assign some team members to focus on these testing efforts. If your IT team is working at maximum capacity merely keeping the lights on, you aren’t going to have the bandwidth to properly ensure the functioning of your failover process.
However, as you develop complex systems with many dependencies, chaos experiments can help you better understand how your systems will respond to service failures. Typically, you capture operational metrics on the steady state operations of your systems.
As you induce failures, even in well-engineered systems, it’s important to understand how dependent systems can be affected. This means capturing metrics such as the latency involved in failovers to other nodes or possibly another data center to allow you to examine the performance of the entire system holistically.
Chaos Testing Best Practices
While many organizations, especially in regulated industries, have executed disaster recovery testing through the years, it can sometimes get neglected as engineers are in limited supply and mainly focus on their day-to-day tasks. Performing chaos testing requires dedicated engineering resources and a budget for hardware and cloud resources. This is why I’ve found simplifying testing to a five-step process can help engineers more easily introduce chaos into their solutions:
- Ensure your current system is stable and define a steady state level of uptime. You’ll need to have monitoring and tracking in place to address any recurring failures. This may include a temporary freeze on changes to ensure the system's stability.
- Once the steady state is proven, ensure the continued stability of the system. This means you need to make the assumption the steady state of the system will continue as you start to introduce chaos into the equation.
- Choose the location for testing. In the early phases of testing, you may choose to build an isolated testing environment to mitigate large risks to your production environment. And as the process matures, inducing chaos into your production environment becomes less of a risk and a better test of your overall environment.
- Start introducing chaos. This process is delicate and dangerous, so as mentioned earli-er, you may want to start in an isolated testing environment. You want to induce com-plete failures, like an instance of an application going away or a database server crash-ing, as well as more intermittent failures, such as flakey network connections or bad sectors on storage devices. The benefit of doing this in production is the process will test your monitoring and alerting systems and help you fully understand how your environ-ment reacts to various failure scenarios.
- Monitor and repeat. While you’ll learn the most in your first couple of attempts at test-ing (which will help identify your major failure points), continuous testing will contribute to a more robust overall architecture.
This process of building successful chaos engineering models can seem daunting to IT professionals. However, given the importance of infrastructure and systems in the modern world, where downtime almost always results in lost revenue, IT should be treated like the profit (and profit-protecting) center it is.
Benefits and Challenges of Chaos Engineering
The benefits of introducing chaos experiments into your organization are relatively obvious—you can build confidence in your infrastructure and your applications. Whether you use a cloud computing provider like AWS, Microsoft, or your own data center, you can use chaos to help ensure the resiliency of those systems. This can also lead to your development teams adopting more fault-tolerant runtimes like Kubernetes, which help ensure the desired state of your applications with minimal downtime.
The challenges of chaos engineering are like any other significant IT effort—it can require sizeable organizational involvement to implement. And, since it profoundly affects every layer of a given IT organization, I’ve found it often needs strong management support at every level and enough funding to accomplish it successfully. During my time as an architect at a large telecommunications firm, I knew it could be challenging to get downtime on highly available production systems. Although we deployed SQL Server failover cluster instances, I still had to negotiate with managers to get thirty seconds of downtime to ensure critical security patches were applied.
Many of the concepts introduced in chaos engineering tie into the broader concepts of SRE and are adopted by a large number of hyperscale providers now. While
“what SRE is” can be its own topic, one facet worth mentioning is the idea of a downtime budget. When establishing the service level objective (SLO) for a given application or service, you should always include how much downtime you expect the service to have.
The downtime budget should not be zero—despite what numerous businesses may say, the uptime of a given application is highly dependent on the number of resources and budgets allocated. A common consultant experience I’ve seen is when you ask a client how much downtime a given app can have, and the business team responds with some excessively small number, like under thirty seconds a month. The consultant then designs a system to meet this goal (which, of course, costs a fortune), and suddenly, the uptime requirements get much more realistic. The most robust of applications may still need several minutes a month of downtime to ensure patches and services can be applied. Where SRE meets chaos engineering is during those planned downtimes, but how do the rest of your applications react?
While starting chaos engineering can seem challenging, you can leverage some of your existing tools as a great starting point and evaluate new tools to help.
SolarWinds Observability Self-Hosted (formerly known as Hybrid Cloud Observability) is built to provide deep, cross-domain analytics and SLO insights into your infrastructure, network, and applications regarding their performance and uptime. It can also provide automatic discovery and dependency mapping while centrally storing historical data from your environment to more efficiently view system status over time alongside actionable intelligence.
Chaos Isn’t Always a Bad Thing
As you can see, developing robust chaos testing systems can be crucial to having well-functioning IT and failover processes. And you can’t consider your overall test environment complete without a way to determine how a production environment handles various hardware and software failures. Bottom line: Although it sounds alarming, adding chaos to your IT operations can help find flaws you may not otherwise know to exist.