Root cause mentality is the idea that when an issue appears within a system, it’s always possible to find a single, underlying reason why, if you just dig deep enough.
Much of the time, this approach to problem-solving makes sense, it feels intuitive, and it leads to efficient solutions. However, it’s a rationale that doesn’t hold up universally — especially when facing problems of higher complexity. When applied to database monitoring, for instance, this logic might lead people to assume that an “ideal” monitoring product should come equipped with powerful root cause analysis: the ability to seek out, identify, and point to the root cause of an issue within the database system. From there, in theory, crafting an effective solution should be easy. Unfortunately, this approach seriously overlooks the actual complexity of modern-day systems. And that oversight can be costly.
When systems reach certain heights of interconnectivity, root causes can no longer accurately explain how they behave. In fact, for many contemporary systems, events occur that have no single underlying cause. Such is the case for high-performance databases, as it is for human organizations, managerial structures, and ways to distribute responsibility. As powerful as this logic might seem to be, it’s dependent on reductionist thinking and has all the attendant weaknesses.
So: why don’t good monitoring products provide root cause analysis? It’s not because root cause analysis is hard or because it’s expensive or because the companies who produce monitoring products don’t know how to do it. In reality, SolarWinds® Database Performance Monitor (DPM) doesn’t do root cause analysis because, for many important problems, there is no root cause.
Complex Systems, From Tech to Human
John Allspaw wrote an excellent blog post about this several years ago, stating "For complex socio-technical systems (web engineering and operations) there is a myth that deserves to be busted," he wrote, "and that is the assumption that for outages and accidents, there is a single unifying event that triggers a chain of events that led to an outage. This is actually a fallacy, because for complex systems: there is no root cause."
Belief in root causes requires a very linear sort of thinking, that “ignores surrounding circumstances in favor of a cherry-picked list of events” and “validates hindsight and outcome bias.” In general, engineers like to gravitate toward root cause bias, because root causes are neat and logical; they are, by definition, efficient. Allspaw goes on to illustrate why this mentality is particularly seductive for tech companies as whole organizations. Consider for example a team confronted with a major outage:
“This tendency to look for a single root cause for fundamentally surprising (and usually negative) events like outages is ubiquitous, and hard to shake. When we’re stressed for technical, cultural, or even organizationally political reasons, we can feel pressure to get to resolution on an outage quickly. And when there’s pressure to understand and resolve a (perceived) negative event quickly, we reach for oversimplification.”
Oversimplification is tempting because it leads to quick resolution. But if the system you’re attempting to understand is not simple, you risk missing important details by trying to force simplicity upon it.
Importantly, the results of such reductionist thinking don’t just affect your view of your systems, but also the views of your company structure and personnel — one of the go-to explanations that organizations use to clarify a crisis is “human error.” But this also undermines the inherent intricacies of people working closely together, in complex social systems with shared and dependent responsibilities. This is something that Baron Schwartz has written about, explaining that when companies “search for a root cause [it’s] usually a witch-hunt in disguise. If you think there is really a single cause, you eventually must identify a single person. If you stop short of that, everyone knows the process was a farce. But blaming a person is also a farce. Everyone knows that someone’s being thrown under the bus and that wasn’t the real problem.”
What might be a possible solution? “Identify the combinations of conditions or dysfunctions that jointly caused the observed problem.”
SolarWinds DPM not doing root cause analysis reflects our fundamental understanding of how databases function and are architected. But it also reflects our understanding of how organizations function on a human level, and how more nuanced logic is beneficial for an organization across the board, from technical relationships, to social ones.
If there’s no such thing as a root cause, can we look for something else instead? Yes: a suite of multiple, interacting, compounded elements, that, in total, add up to a composite result.
Finding that kind of complex mix of factors is exactly what DPM was designed to do. By establishing that root cause analysis doesn’t exist — because, again, root causes themselves don’t exist — DPM aims to provide a more effective, holistic method of database monitoring. We've witnessed firsthand how complicated databases can defy traditional cause-and-effect relationships.
For example, there have been cases when customers make very similar changes to their systems, to different results... in one case, a customer made an adjustment and saw CPU usage drop, while another customer did the same, and it spiked. In both cases, CPU was the symptom, not the cause. In both cases, something more complicated was going on and applying a single fix resulted in a more complicated outcome. With a monitoring product like DPM, a user can dig deeper and eventually uncover what’s really going on… even if the answer isn’t simple.
Database monitoring is inherently hard, laced with problems that cannot, by nature, be shrunk down. With that in mind, DPM hasn’t been designed to be an automated, panacea, miracle solution (such a thing doesn’t exist) but instead helps smart people do amazing, difficult things. Database management is a genuine, pressing question for every web-based organization on the planet; in order to help users solve the truly tough issues, we aim to educate our audience and industry in addition to providing our product (our resources are a big part of that effort).
In our experience, if you give even non-DBAs insightful access to their data, they make important discoveries about their systems. We shouldn't make the mistake of trying to simplify or reduce those systems, or the people who solve the problems inside them. It's with a combination of knowledge, powerful tools, and the right understanding of how these systems function that otherwise unsolvable problems can be addressed.