Monitoring is a lot of work. A monitoring system is a service unto itself, and like other applications it needs care, feeding, configuration, databases, DBAs, patching, and maintenance. Though it often prevents the loss of revenue, it doesn’t generate revenue except in rare circumstances. Last, it requires vigilance and attention to detail that is often overlooked or impossible in a busy IT environment. So why even bother?
Put simply, we monitor to find and fix problems before our customers and users do, so they’re not tempted to go somewhere else for their purchases, or take our names in vain when trying to get to their email. We also use our monitoring data for business decisions. Should we buy more or less capacity for our services? How much will moving an application to the cloud cost? Is a decision we made affecting availability? Having quality historical monitoring data is invaluable for making decisions.
There are three main ways we find and fix problems. First, we monitor systems and services directly, inferring their availability from tests we create. Second, we collect performance data from all our systems and infrastructure, allowing us to find less obvious problems like busy uplinks and system limits. Last, we collect log data from all our systems and infrastructure and attempt to correlate errors and logging bursts with performance and availability issues.
If you ask any IT staffer what monitoring is, they’re likely to respond first with break-fix or availability testing. This is the sort of monitoring everybody does right away, testing an application or service to see if it’s online and working correctly. We do this testing in two ways. There’s direct testing of applications and services, where we test each component and look for specific responses. This kind of testing is ideal; we know that something works because we test it specifically, looking for exactly what we need. This kind of testing is also much harder to set up, taking much more time and requiring meticulous attention to detail in order to test each component of a system.
Alternately, there’s indirect testing. This type of testing infers the availability of applications and services based on other signs. For example, we might infer that a web application is working because the web server process is present, or perhaps that the web server responds on TCP port 443. This type of testing is much easier to set up, partly because monitoring tools can do these types of tests out of the box, and partly because you don’t need to get deep into the specifics of an application.
Indirect testing isn’t ideal. Any alarms generated require a lot of human involvement to confirm and diagnose. It’s also very possible that no alarm will be generated for a particular problem. While it’s very easy to start with indirect testing, IT organizations should strive to move toward direct testing of systems and applications, instrumenting applications directly and monitoring every layer of the stack.
The second main type of monitoring IT does is performance monitoring
, gathering data about the infrastructure and applications to determine if everything is working as expected. As with availability monitoring, there are good and bad ways to do this. First, it’s important that every part of the infrastructure be monitored. Any system, uplink, application, or OS that isn’t being polled for performance data is a place where problems like latency can hide. This can be difficult in complicated environments. Management protocols like CIM, SMI-S, and SNMP are far from simple, and often arcane afterthoughts in a vendor ecosystem.
It is very important to collect and keep good data. That may seem obvious, but many performance monitoring systems begin averaging data after a certain amount of time to conserve disk space. This averaging masks peaks and troughs, and effectively makes the data useless for any troubleshooting or serious trending. Disk space is cheap, and folks should find a way to keep the high resolution data as long as they can.
Last, of course, is the need to present the data in a useful way. In a crisis, when time is short, you should not be struggling to view data. Graphical displays are essential, and graphing systems that allow flexible zooming and data selection are extremely useful. It is also nice to be able to export the data for consumption in other tools like Excel, or even statistical tools like SAS or R. This can be especially handy when looking at trends for business planning and system design.
Log Monitoring and Correlation
The last of the primary types of monitoring is log monitoring and correlation. For years, SysAdmins have been sending their logs to a central repository for security reasons, but recently there has been a lot of good work to take those logs and use them as a source of intelligence. A good log monitoring and correlation system can spot log messages of interest and correlate them with performance and availability issues. For example, a virtualization system that begins complaining about storage path issues can be correlated with an application performance issue. In turn, this helps us humans quickly see both the root cause of a problem and the effect it’s having.
Monitoring is something with distinct ROI, though that may be hidden or overlooked in and among more glamorous IT projects. Monitoring requires a commitment and a detail-oriented mindset. It’s very easy to make monitoring an afterthought, always saying you’ll get back to it, but never actually doing so. Done right, though, you can find and fix availability problems quickly, minimizing disruption to users and to the lives of IT staff. You can also provide the business with solid data to underpin system designs and procurement decisions. Who doesn’t want to do that, especially as IT becomes more business-oriented?