/ Monitoring, Analytics, Diagnostics, Observability, and Root Cause Analysis

Monitoring, Analytics, Diagnostics, Observability, and Root Cause Analysis

December 3, 2017

Page Contents

Monitoring is a hopelessly overloaded term in tech culture. The term now carries decades of inaccurate and imprecise use. The result is that several people can be engaged in an earnest conversation about monitoring and, despite efforts to get each other to see what they mean, remain on totally different wavelengths. I know, because I’ve seen it happen many times. It’s amazing how many times I’ve seen people frustrated with each other because they mean different things when talking about these words. I was speaking with some folks recently about monitoring, diagnostics, analytics, observability, and root cause analysis. I draw distinctions among these topics and my audience was not clear on why. They asked me to elaborate, and I told them I’d write it up. This blog post is the result. TL;DR: Monitoring is the process of observing systems and testing whether they function correctly. Analytics is the process of turning data (usually behavioral data) into insights. Observability is the property of a system that supports analytics. Diagnostics is the process of determining what’s wrong with a system, and also relies on observability. Root cause analysis is corporate mumbo jumbo.

Monitoring

Monitoring is one of the most lamentably mangled terms in tech. It’s so simple, but it’s ended up being stretched to cover a very large scope of things that, if you think about the root meanings of the words, is just not monitoring. Rather than talk about what it’s not, though, I’ll talk about what it is, or what it properly should be if it hadn’t been convenient to market it as something other for the last three decades. Monitoring is the activity of observing a system in the present tense and testing its behavior against an existing definition of correct or acceptable. If the system’s behavior is within bounds, it is healthy; if out of bounds it’s unhealthy. The activity of monitoring includes at least exposing the system’s status to any person or other system who may wish to observe it; and possibly actively communicating the status. It may not be obvious, but monitoring doesn’t have to be automated. A human can monitor a system just as well as a monitoring system does. Monitoring and monitoring systems are separate. Monitoring systems are often assumed to do so, so much more: take action to remediate, store historical data, perform capacity planning, manage notifications and escalations, etc. These (and literally a score of other things) are often convenient to have within a single tool, but that does not make them monitoring. There’s a special place in my heart for the term proactive monitoring. Some people think this means the monitoring system is proactively configured to look for a symptom, e.g. “I set up proactive monitoring to tell me when the disk is 95% full.” (That just sounds like monitoring to me, no adverb needed.) Others think proactive monitoring should mean the ability to detect problems before they happen, so they can act proactively to prevent them. I haven’t seen that work well, and usually a bit of clear thinking about a “proactive monitoring system” shows it isn’t proactive at all.

Analytics

Analytics is the activity of turning data into answers. Interestingly, many analytics systems are ridiculously flexible and capable, and can do nearly arbitrary stuff to data. Think Tableau, Looker, and the like. Monitoring systems often have a very limited subset of analytics capabilities by comparison. For some reason, people I know have objected to applying the term “analytics” to the process of learning about systems; they think analytics should be relegated to user-behavior analytics, such as examining daily active users and the like. I don’t share this view.

Diagnostics and Root Cause Analysis

Diagnostics is the process of figuring out what’s wrong. The word comes from diagnosis, which has Greek roots (you probably already know what gnosis means!). More broadly, diagnostics is the activity of figuring out what caused a situation. I have a lot of experience, as well as logic, that tells me there is always an intersection of several necessary, but only jointly sufficient, causes. There is no root cause. That’s why root cause analysis is a fallacy and I don’t talk much about it. Should monitoring systems be capable of diagnosis? Your answer to this probably depends on your context. In some contexts, it’s expected that monitoring systems have a lot of data, current as well as historical, and that they are built to support the process of interrogating that data and arriving at answers about causes. But this is not universally true. I don’t think anyone would insist that Pingdom is not a good monitoring system because you can’t perform diagnostic activities with it. The same applies to Nagios. This is why I draw a sharp distinction between monitoring and diagnostics. Monitoring is a relatively simple, tightly bounded scope of activities and use cases. Tools that can do all of these massively complex things like diagnostics have often been called monitoring systems, but that’s just because words are hard.

Observability

Observability is the new kid on the block. I wrote previously about the difference between monitoring and observability. In short: observability is a property of a system, similar to testability, operability, usability, etc. A system that’s observable is instrumented and emits telemetry, so systems that capture that telemetry can be used to interrogate the system about its behavior and derive answers (e.g. an observable system supports analytics and diagnostics).

In Conclusion

This isn’t the first time I’ve written and spoken about these topics. The reality is that all people always have unexamined, unconscious assumptions: they know these words to have specific meanings, and they often don’t know other words at all. (Observability is a good example. Many people have never heard of it.) Finally: why am I such a curmudgeon about terminology and meaning? Why do I insist on splitting hairs? Because conversations are like computers: garbage in, garbage out. Life’s too short for arguing about who’s on first. I value clear communication, but naturally I know and accept that communication is a messy, imprecise, human activity. It’s the speaker’s job to make sure the audience understands!

Baron Schwartz

Baron is a performance and scalability expert who participates in various database, open-source, and distributed systems communities. He has helped build and scale many large,…