- You install some package of plugins for monitoring MySQL’s replication status.
- One of the alerts is based on replication delay relative to the master. You can choose thresholds for warning and critical delays, just like all Nagios threshold-based checks.
- You feel sure that the delay behind the master is very important to monitor, so clearly you must choose not one, but two thresholds. What are the right numbers?
- You reason that the server ought to run with no delay in normal circumstances. But just to give it lots of room for an occasional abnormality, you set a warning at 1 minute, and a critical alert at 5 minutes. That seems too lax, but you’re afraid to set the tolerances any tighter.
A Sure-Fire Recipe For Monitoring Disaster
April 9, 2013
Database
In this post I’ll tell a story that will feel familiar to anyone who’s ever monitored MySQL. Here’s a recipe for a threshold-based alert that will go horribly wrong, beyond a shadow of a doubt.