How NOT to Monitor Your Database
Do you have experience putting out backend database fires? What were some things you wished you had done differently? Proactive database monitoring is more cost efficient, manageable, and sanity-saving than reactive monitoring. We reviewed some of the most common mistakes - too many log messages, metric “melting pots,” retroactive changes, incomplete visibility, undefined KPIs - and put together an action plan on how to prevent them. From our experience, we've listed out the top 5 biggest (and preventable!) database monitoring pitfalls.
There never seem to be enough logging levels to capture the desired granularity and relevance of a log message accurately. Is it
DEBUG? What if it’s
DEBUG but it’s for a condition we should
WARN about? Is there really a linear hierarchy here? If you’re like most people, you’ve seen at least once an extension of those types of standard logging levels on top of a widely available logging system in an attempt to add even more custom levels. There exists a good argument that there should really only be two types of log messages: those useful for writing and debugging the code, and those useful for operating it.
Mixed Status and Configuration Variables
Many systems don’t distinguish between status variables, which signal the system’s state, and configuration variables, which are inputs to the system’s operation. For example, in both MySQL and Redis, the commands to get system status will return mixtures of configuration variables and status metrics. Such a metrics “melting pot” is a very common problem that usually requires custom code or exception lists (blacklist/whitelist) to identify which variables are what.
Breaking Backwards Compatability
If you change the meaning or dimensions of a metric, ideally you should leave the old behavior unchanged and introduce a replacement alongside it. Failure to do this causes a lot of work for other systems. For example, in MySQL, the SHOW STATUS command was changed to include connection-specific counters by default, with the old system-wide global counters accessible via a different query syntax. This change was just a bad decision, and it caused an enormous amount of grief. Likewise, the meaning of MySQL’s “Questions” status variable was changed at one point, and the old behavior was available in a new variable called “Queries.” Essentially, they renamed a variable and then introduced a new, different variable with the same name as the old one. This change also caused a lot of confusion. Don’t do this.
Again, the easiest example of this is in MySQL, which has had a
SHOW VARIABLES command for many years. Most, but not all, of the server’s command line options had identically named variables visible in the output of this command. But some were missing entirely, and others were present but under names that didn’t match.
The list of crucial metrics for finding and diagnosing performance issues isn’t that large. Metrics such as utilization, latency, queue length, and the like can be incredibly valuable, and can be computed from fundamental metrics, if those are available. But you’d be surprised how many systems don’t have any way to inspect these key metrics, because people without much knowledge of good monitoring built the systems. For example, PostgreSQL has a standard performance counter for transactions, but not for statements, so if you want to know how many queries (statements) per second your server is handling, you have to resort to much more complex alternatives. This lack of a basic performance metric (throughput) is quite a serious oversight.