IT Alerting Best Practices: Notification Routing for IT Ops
Every operations team has its fair share of monitoring solutions. While you may not have achieved the perfect state of a single pane of glass, you likely have settled on two or three solutions that cover all the hardware and software that supports your business. You even invested considerable time and effort to not just implement these solutions with out-of-the-box settings, but with tailored IT alerting thresholds and alarms that suit your environment’s specific needs. Look at you!
It’s easy to overlook the next step of your monitoring deployment: notifications. Usually, one of two things will happen:
- You get too many alerts, and subsequently turn off alerts
- You get too many alerts, and subsequently create an Outlook rule to trash them all
It’s the age-old signal-to-noise problem in IT. How do you fine-tune your notifications, so they alert you to events that deserve your attention while filtering out all of the notifications that are not actionable? Your first thought might be to turn off any performance-related alerts and just receive system or device down notifications. But if that’s all you’re looking to get out of notifications, you should just write a PowerShell script to run a Test-Connection against your server list and Send-MailMessage when a host is down. (That’s mostly sarcasm.)
Instead of throwing the baby out with the bath water, here are some monitoring and alerting best practices for reducing notification overload.
Inventory Your Applications
First things first: no one cares about your servers like you do. You invest countless hours building, installing, patching, backing up, repairing, and generally supporting these virtual beasts. Even if you’re running an automated shop (which you’re not), you still train your attention on the infrastructure. But the business cares about the applications.
Application monitoring will always be more important to the business than infrastructure monitoring.
So, if you don’t have a list of your apps (which should include URLs in this modern SaaS era), get one together. Without a reliable and accurate inventory, you’ll never know if you’re monitoring all your devices.
Map Applications to Devices
Now it’s time to correlate infrastructure with applications. In other words, if server org1east-c goes offline, what applications are affected? What if the NAS doesn’t survive a firmware upgrade? When you can draw direct connections between your applications and the infrastructure, you can shift the focus of your monitoring (and eventually notification routing) to the right teams as quickly as possible.
The benefit of this exercise is to tune your alerts and notifications to reach the right teams right away.
Create Your Notifications
It’s useless if your monitoring solution detects issues and fails to notify cognizant IT staff. But because you’ve got a list of your apps, and you have mapped connections between your server infrastructure and applications, you’ll be able to set up your notification routing efficiently.
For example, if one of your load balanced web servers goes offline, you can have an alert sent to the server team to investigate the server. But don’t stop there. Also have an alert sent to the team that supports the website or app that relies on that web server. They may not need to take any corrective action, but they’ll certainly appreciate a heads-up that there may be infrastructure trouble brewing. And you’ll also avoid fielding calls from the web team asking, “What’s going on?”
Routing notifications isn’t the most exciting part of deploying a monitoring solution, because it’s likely the most difficult. Not because of the technology, but because of the deep dive required to really understand the connection between your applications and your infrastructure.