Choosing What to Monitor - Understanding Key Metrics
Monitoring has always been a loosely defined and somewhat controversial term in IT organizations. IT professionals have very strong opinions about the tools they use, because monitoring performance metrics and alerting is one of the key components of keeping systems online, performing, and delivering IT’s value proposition to the business. However, as the world has shifted more toward a DevOps mindset, especially in larger organizations, the mindset around monitoring systems has also shifted. You are not going to stop monitoring systems, but you might want you want to rationalize what metrics you are tracking in order to give better focus.
WHAT TO MONITOR?
While operating systems, hardware, and databases all expose a litany of metrics that can be tracked, choosing that many performance metrics can make it challenging to pay attention to critical things that may be happening in your systems. Such deep dive analysis is best reserved for deep troubleshooting of one-off problems, not day-to-day monitoring. One approach to consider is classifying systems into broad categories and applying specific indicators that allow you to evaluate the health of a system.
User Facing Systems like websites, e-commerce systems, and of course everyone’s most important system, email, have availability as the most important metric. Latency and throughput are secondary metrics, but especially for customer facing systems are equally as important.
Storage and Network Infrastructure should emphasize latency, availability, and durability. How long does a read or write take to complete, or how much throughput is a given network connection seeing?
Database systems, much like front-end systems, should be focused on end-to-end latency, but also throughput--how much data is being processed, how many transactions are happening per time period.
It is also important to think about which aspect of each metric you want to alert (page an on-call operator). I like to approach this with two key rules: any page should be on something that can be actionable (service down, hardware failures), and always remember there is a human cost to paging, so if you can have automation respond to page and fix something with a shell script, that’s all the better.
It is important to think about the granularity of your monitoring. For example, the availability of a database system might only need to be checked every 15 seconds or so, but the latency of the same system should be checked every second or more to capture all query activity. At each layer of your monitoring, you will want to think about this. This a classic tradeoff of volume of data collection in exchange for more detailed information.
AGGREGATION
In addition to doing real-time monitoring, it is important to understand how your metrics look over time. This can give you key insights into things like SAN capacity and the health of your systems over time. Also, it leads you to be able to identify anomalies and hot spots (i.e. end of month processing), but also plan for peak loads. This leads to another point - you should consider collection of metrics as distributions of data rather than averages. For example, if most of your storage requests are answered in less than 2 milliseconds, but you have several that take over 30 seconds, those anomalies will be masked in an average. By using histograms and percentiles in your aggregation, you can quickly identify when you have out of bounds values in an otherwise well-performing system.
STANDARDIZE
Defining a few categories of systems, and standardizing your data collection allows for common definitions, and can drive towards having common service level indicators. This allows you to build a common template for each indicator and common goal towards higher levels of service.