This post is part of an ongoing series on the best practices for effective and insightful database monitoring. Much of what's covered in these posts is unintuitive, yet vital to understand. Previous posts have covered Why Percentiles Don't Work the Way You Think; how to avoid getting to a point When It's Too Late to Monitor; how to tell If a Query Is Bad; and an explanation of why, when looking at charts, you should understand that A Trendline is a Model.
Alerting when a metric crosses a threshold is one of the worst things you can do.
Why? Because threshold-based monitoring is prone to false positives and false negatives, which correspond to false alarms and missed alarms (a.k.a. useless noise and useless silence).
One of the worst things about most monitoring systems is the incredible amount of noise they generate. IT staff members react to this predictably: they filter out as many alerts as they can, and ignore most of the rest. As a result, a system that didn’t really work well to begin with becomes even less useful.
This problem is universal. When I speak at conferences, I ask how many people are using Nagios. Usually well over half of the hands go up. I then ask how many people do not
have email filters set up to shuffle some Nagios alerts to /dev/null. There’s usually one or two hands remaining, and everyone laughs uncomfortably.
We talk about root cause analysis for systems, but what’s the root cause of the email filters? It’s largely due to thresholds. Simple alive/dead status checks are easy to get right, but a lot of Nagios-style alerting is based on thresholds.
This never works.
In theory it could sometimes work (albeit rarely), but in practice it never does. In my next post I’ll explain why this is.