In previous posts, I claimed that thresholds are a root of much evil in monitoring systems (not the root of all evil, but a root of much evil), and that we’ve developed threshold-free fault detection that really works.
You probably know that this is a hard problem (some scientists from Bell Labs last week said it is equivalent to discovering cold fusion). I would be a little more modest. It’s more akin to discovering Bloom filters, or perhaps B-Trees. Yes, it’s damn cool, but it’s probably not the solution to end all solutions.
But that’s not what I care about the most. The main reason I’m excited about what we’ve discovered is that it has a lot of enabling characteristics that make hard things easy, without adding new problems:
- It has an extremely low rate of false alarms and missed alarms.
- It has no configuration. It is completely adaptive and self-learning.
- It distinguishes between abnormal and problem states. Remember, systems are constantly in abnormal states.
- It distinguishes between problems that happen within the boundaries of the system, and the effects of external problems.
- It is highly context-sensitive. It does not rely on pattern analysis of the previous week’s worth of data, time-shifting, operational envelopes, Big Data, or any other attempt to solve the problem by throwing more signals into the fray.
- It is computationally cheap (a few CPU cycles per observation and a tiny, fixed amount of memory) and thus operates in “real-time.” It detects and diagnoses problems in under two seconds. This lets it catch system faults early and small, before they grow, and long before most humans will ever notice them.
- It is highly dynamic and adapts to rapidly changing conditions while still accurately distinguishing abnormal from bad (hence the name Adaptive Fault Detection).
- Because it has no preconceived notions, it discovers problems no one has even considered. Contrast this to existing monitoring systems, where all of your checks are born of prior experience or conjecture.
That’s why we think Adaptive Fault Detection is an improvement over the typical monitoring system’s reliance on thresholds.
How does it work?
There is a lot of subtlety to it, but in simple terms:
- Not all metrics are created equal. We don’t have systems just to keep their CPUs busy. The purpose of a server is to do work. The metrics that measure the work being done are more important than others.
- Therefore, by necessity, we need to know the meaning of at least some metrics, if not all of them.
- We define a fault as “the system is being asked to perform work and it’s not getting it done.”
- We use a special mathematical sauce.
- We apply a special type of adaptive damping to suppress false positives when there’s no way to know if a fault is real.
That’s about all there is to it. Of course, this is all nice in theory, but does it work in practice? Yes, it does. First, I have over 10,000 samples of workloads, ranging in length from a day to two weeks, of all kinds of systems, at one-second granularity. I have literally inspected the results of this algorithm on thousands of workloads. Unsurprisingly, it also works great on our alpha customers’ systems in production.
After all that hype, this may sound odd, but I don’t want to make too much of a big deal out of Adaptive Fault Detection. That’s because current monitoring systems have much more fundamental shortcomings than their use of thresholds. Even if the thresholds-beget-false-alarms problem were solved, monitoring systems like
Nagios fall far short of what’s really needed. We’re aiming our sights a lot higher than just fixing false alarms.
Sign up for a free trial, and you can kick the tires and see for yourself.