5 Monitoring Myths That Deserve to Be Busted
August 4, 2016 |
There’s no doubt that effective application performance management can be a daunting goal. As businesses have demanded increasingly complex tasks from their technology, the solutions required to keep their systems in top shape need to be ever more insightful and precise. While things like efficiency and uptime are the bottomline indicators of an app or database’s performance, there’s a variety of potential methods that can be used to achieve those ideals.
As in any other industry, false information proliferates where people are eager for quick, simple solutions. Naturally, we all want the fastest, most direct ways to fortify our APM. But where ambitious ideas can be powerful and positive, they can also be too good to be true or send the wrong message. Whether these involve — for example — overestimation of percentiles’ application or misunderstanding root-cause analysis, such expectations are pernicious in how they lead people to look in the wrong places or misuse the information they have at hand.
We’ve encountered a few of these misconceptions that stand out from the rest — here are 5 monitoring myths that deserve to be busted.
1. Using Percentiles Will Solve the Problems Presented by Averages
Generally, people we talk to have wised-up about using averages: they know that averages are pretty poor indicators of what you should be looking at within your dataset. Averages are lousy for monitoring for two main reasons: averages obscure outliers; and, because of those outliers, what you do see is skewed and unrepresentative of typical system behavior.
That’s no secret. However, the new problem is that as people have abandoned averages, they’ve begun migrating toward percentiles. People we talk to now frequently want to know about usage of p99 of metrics. The problem is that even when using percentiles, averages still play a role — and that still skews information. As Baron wrote,
“There’s a big problem with most time series data and percentiles. The issue is that time series databases are almost always storing aggregate metrics over time ranges, not the full population of events that were originally measured. Time series databases then average these metrics over time in a number of ways.”
As you can imagine, this is still a major hurdle. The final verdict, when using percentiles in this way? “The math is just broken,” Baron said. There are a number of better methods to utilize for communicating the information contained in percentiles, such as heat maps, histograms, banded metrics, and sketches. The most important thing is not to simply assume that percentiles are the easy solution and workaround that many people think they are.
2. An App-Focused APM Solution Can Do it All
More than providing technical insight, busting this myth is about understanding system architecture and APM products’ marketing positions. As the APM industry continues to grow and mature, it’s important to understand a few things:
With these ideas in mind, you can begin to understand how database monitoring fulfills one of the biggest holes left by app-centric APM products, as it sees activity that only occurs in the polyglot, persistence tier. Until you’re able to view your database insightfully, you’ll be operating with critical visibility gaps. App-focused solutions can have a huge positive impact — but it shouldn’t be at the cost of leaving your database unmonitored.
- It’s vital for web-based companies to have a good monitoring solution in place. There is an accurate comparison to be made between monitoring and health insurance — it’s foolish to go without it. However, a major difference is that whereas you hope to never need your health insurance, you will need your monitoring solution at some point.
- There are several extremely powerful solutions available, with varying use-cases and focuses. Depending on your specific needs and resources, different APM products, aimed at different parts of your system, will generate different advantages — and limitations.
- Your system is divided among several layers and tiers — even if you can see one tier clearly, it doesn’t mean there aren’t other things happening somewhere else in your system that you should be aware of.
3. Global Metrics Can Give a Full Picture of Your System’s Health
There are plenty of metrics that are nice to look at and can be translated into handsome charts and visuals. But vanity metrics obviously provide little help when addressing real issues and systems with high demands.
The more important message here is how you can define what truly valuable metrics look like: we describe these as “work-centric.” These are readings that go much deeper than status counters or wide-angle-view resource metrics. The problem with those is that they’re all relative. The danger zones or ideal states for a system’s CPU or IO disks or cache ratio are totally dependent on the needs presented in the specific system at hand. What might be disastrous for one set-up might be perfectly satisfactory for another.
The philosophical difference when we think about APM in terms of “work,” is that we recognize that a system’s purpose is performance. As Baron Schwartz wrote in an article for O’Reilly in 2013 (emphasis from the original), “Workload should be regarded as primary importance. We have servers to do work for us... We measure success by how much work the system can do for us, and how consistently. In other words, we want to know the speed and quality of getting-work-done.
Our friends at Datadog wrote an article about Monitoring 101, where they outline some key work metrics, things like “throughput,” “success,” “error,” and “performance.” These are all intrinsically extremely granular concepts, and they’re powerful for that reason. At We encourage each system and its users to establish a measurable unit that represents work-getting-done in whatever database you use. “Queries that are executing” is a good example, and it’s exactly this kind of approach — paired with advanced statistics and mathematical insights from foundations like queueing theory — that power and define our adaptive fault detection.
This position also helps us make a clear distinction between “monitoring” and “performance management,” the latter of which is a much more useful term. Our goal is to be able to provide customers not just with a “framable snapshot” of their systems, but incisive feedback that can lead users toward real, meaningful solutions and proactive performance management.
4. Root-Cause Analysis is the Holy Grail of Monitoring
The debunking of root-cause analysis is actually fairly straight-forward. That's because, as John Allspaw wrote in his blog, “There is no root cause.”
The rationale behind this statement and the web of influences that cause people to seek root causes is much more complicated. We recently wrote an extended explanation about how recognizing that systems do not have a root cause translates into a more informed and powerful approach to performance management overall.
When systems reach certain heights of interconnectivity, root causes can no longer accurately explain how they behave. In fact, for many contemporary systems, events occur that have no single underlying cause. Such is the case for high-performance databases, as it is for human organizations, managerial structures, and ways to distribute responsibility. As powerful as this logic might seem to be, it’s dependent on reductionist thinking and has all the attendant weaknesses.
Like other major monitoring myths, the expectation for root-cause analysis comes from the wish to make a complex goal simpler than it can truly ever be, by definition. J
5. More Active Alerting Always = Better Alerting
Monitoring should be proactive, and it should point out problems before they evolve into disasters. But does that mean the sensitivity of your alerting should be turned up as high as possible? In short, no. Or, not unless you have a precise calibration of specificity. As Dan Slimmon discusses in his article “Car alarms and smoke alarms,” alerting at a sensitivity much higher than your specificity can lead to a harmful amount of false positives. As he continues,
“You’ll often be tempted to favor high sensitivity at the cost of specificity, and sometimes that’s the right choice. Just be careful: avoid the base rate fallacy by remembering that your false-positive rate needs to be much smaller than your failure rate if you want your test to have a decent positive predictive value.”
When necessary, resist the temptation to over-alert! Likewise, you should almost never alert on thresholds — they just cause too much noise, which leads to your own desensitization. As Baron has written,
“One of the worst things about most monitoring systems is the incredible amount of noise they generate. IT staff members react to this predictably: they filter out as many alerts as they can, and ignore most of the rest. As a result, a system that didn’t really work well to begin with becomes even less useful.” Don’t let that be the case for you.