Why You Still Need Network Alerts
September 1, 2015
Networks
Retracing an intractable alert hairball today, I’m reminded of how critical, and often critically broken, alerting actually is in many environments. You can be a department of one with only a handful of systems sending occasional advisory messages, or a large IT team with hundreds of alerts of every variety, and you have the same problems. Most network alerts are noise, but a tiny handful are really serious. I’ve been able to improve or at least accept certain idiosyncratic IT macro-conditions over the years, but no matter where I go, broken alerting causes more indigestion than it should.
Once upon a time, I worked at an airline. I won’t say which airline, but it was an American airline. In this role, one of my systems gathered and dispatched real-time flight status information to the reservations website. It was used by visitors to check when they needed to be at the airport to pick up grandma. It also powered the system that sent flight change information to passengers’ mobile devices. Some people might argue that this is the sort of system you might want to ensure is always working, and it should rapidly send alerts and escalates to, well, everybody. It had too many moving parts, tricky dependencies, well-intentioned crocks, third-party mainframe interfaces, and potentially upset grandmas.
Grandmas have a pretty effective alert system of their own. They simply call their tardy children sounding simultaneously accommodating yet disappointed as only a grandma can do. But every system in bigairline.com’s application chain used different alerting methods. Worse, the systems were split between three different business units that didn’t play well together. The TPF flight ops system wrote errors to a log I wasn’t allowed to access, and it sent IBM® MQSeries error messages to a central NOC event hub... that didn’t have security policies to allow redirection to me. I built the process that diffed flight updates against an MS SQL database to figure out what changed. I knew what my system was up to, but had no tool to warn downstream bigairline.com of upstream trouble. Of course, I didn’t manage the firewalls, SQL cluster, network links, or other infrastructure components, all of which sent alerts somewhere, but not to me. Yes, grandma, computers are hard. Thanks for the hug.
In some IT organizations you have the additional frustration of data trolls. You know the type. They have amazing, encyclopedic understanding of their systems, but tease you with reports or log snippets that they won’t make available. They hide behind policies and departmental boundaries, or interpret IT governance regarding access as they see fit. Fortunately, data trolls are rare. Most IT pros will try to help, forwarding alerts for whatever you need. The problem is they don’t have the time or expert knowledge to configure exactly which alerts you need, so they send pretty much everything. All the time. Without end. It drives the problem of determining what’s important downstream to your email inbox.
The flight data system alerting had the same challenge as many complex systems; the easiest way to cross boundaries was email. Any individual system alerts in a way that’s most reasonable for it, (syslog, log files, events, IVR calls, voodoo), but the only universal format is email. It’s convenient to configure, reasonably reliable*, and has the capacity to carry as much detail as the system needs to send. The problem, of course, is it’s dumb. You can try to create Nobel Prize-level Exchange rules to sort and dispatch them, but in the end, critical messages end up sitting in a mailbox.
*Yes, the network can go down, so you still need your trusty v.32 Smartmodem on a serial port.
To solve my email issue, I ended up building a monstrosity. I crafted a Perl cron edifice, which imported emails from Novell® “Group(un)Wise” with POP, parsed contents, split digests to a file, then read config tables in a shared Excel® spreadsheet to figure out who was on-call with the pager. I learned two important lessons from that. One, don’t use shared Excel spreadsheets for anything important, like making sure grandma isn’t freezing on the curb outside O’Hare. Second, Perl is Anger, the fifth circle of hell. (I’m sorry, but someone has to say it).
The hack was workable, and we always knew when the flight data died. Even better, bigairline.com knew as soon as we did. I made sure that if the failure was upstream, bigairline.com operations knew in the initial alert that it wasn’t us. But after years of trying everything I could think of, the TPAP (Temporary Perl Alert Processor) never went away, and for all I know it’s still running. I was sure that in my next job things would be different and that surely the airline was unique. Surely no other IT environment could be as messed up when it came to ensuring accurate and actionable alert flow between critical operations systems, across business boundaries, and out to the right admins for resolution. But I was wrong. And don’t call me Shirley.
Some IT teams are better than others. A few are even getting really close to an easy-to-manage operation where critical messages aren’t missed. Everyone knows who has the pager and the pager doesn’t go off at 3 a.m. because toner is low. (Though an SMS when the CEO’s paper is low might be handy). Great alerting happens when smart admins take time to redirect the fire hoses, invite gurus to lend their resolution expertise, and capture smart alert handling processes in one place. It is possible; I’ve seen it and basked in its warm rays. I was even able to spend a little more time with grandma on weekends. Although, one time I was late when the airline failed to send an SMS message that her flight was early.
Anyway, I’m adding a help desk escalate filter to an application monitor with a considerable history of flapping, and that’s what I was thinking about. It’s an off-label use, but that’s the point of being a geek—teaching your monitoring systems new tricks. That, and then humbly bragging about your supererogatory cleverness in your favorite admin community. I already have AD® integrated to steer escalates by OU, and compared to managing the TPAP, it’s a snap.
No problem, grandma. Of course I knew your flight was early.