Why is a threshold-based alert such a disaster? There are two big reasons.
- Thresholds are always wrong. They’re worse than a broken clock, which is at least right twice a day. A threshold is wrong for any given system, because all systems are slightly different, and it’s wrong for any given moment during the day, because systems experience constantly changing load and other circumstances.
- Abnormalities aren’t problems. A threshold is intended to detect when something is abnormal or unacceptable, but it’s based on the false premise that abnormality indicates a problem. The truth is that systems are constantly in and out of abnormal states — what you think is abnormal is actually normal and common for a system’s state. As a result, any alert you set on a metric exceeding what you think is a normal threshold is going to fire a lot — way more than you think it will — because of conditions that aren’t even problems. This is also true of standard-deviation-based thresholds and Holt-Winters predictions and so on. A monitoring system needs to know the difference between an unusual state and a real problem, and this isn’t possible with a threshold.
That’s why the entire idea of alerting when a metric exceeds a threshold is completely, fundamentally broken. It can’t be fixed. It’s the wrong thing to do.
We’re not the only ones who have noticed this. Just to cite one example, Mathias Meyer alluded to these problems, saying:
Monitoring still involves a lot of looking at graphs, correlating several different time series after the fact, and figuring out and checking for thresholds to trigger alerts.
… We spend an eternity looking at graphs, right after an alert was triggered because a certain threshold was crossed. Does that alert even mean anything, is it important right now? It’s where a human operator still has to decide if it’s worth the trouble or if they should just ignore the alert.
… I’d like our monitoring system to be built for humans.
Mathias makes a lot of great points in his post, but a thread that’s running between the lines, a little bit unspoken, is that there’s something utterly wrong about the idea of graphs and thresholds to begin with. A big part of what’s wrong is the impossibility of thresholds. Trying harder to make thresholds work, or trying harder to use them to detect abnormalities, is like the man who dropped his glasses in the gutter but looks for them in the street, where the light is better.
Setting thresholds on metrics isn’t going to help. Doing a more careful job of it isn’t going to help. Detecting things that are out of the ordinary isn’t going to help. You can’t get there (elimination of false positives and false negatives) from here (metrics, thresholds, and graphs).
At SolarWinds, we have developed technology that almost completely
eliminates threshold-based alerts. This is one of the Holy Grails of monitoring, and many brave souls have tried to do this before–and failed. Our technology is called Adaptive Fault Detection, and it
works. I’m tremendously excited about it.