Home > Too Many Wrong Alerts Can Hurt Your Business

Too Many Wrong Alerts Can Hurt Your Business

As the famed psychologist Abraham Maslow once said, “If all you have is a hammer, everything looks like a nail.” Because IT management systems make it simple and easy to create alerts for all kinds of events and conditions, the temptation may be to create alerts for all of them. This can be problematic for a variety of reasons, though two stand out in particular:
  • If you create a plethora of alerts to which IT staff must respond, they’ll end up spending the bulk of their time on them. This may make IT operations less productive and more expensive instead of boosting IT productivity and promoting savings.
  • If IT staff must deal with mountains of alerts, how will they know which ones are important and which ones can be handled later? Again, lack of priority or urgency may make IT less effective and efficient and more costly.

Choosing and Using Alerts Wisely

Rather than pursuing a “more is better” strategy when creating alerts, IT organizations should focus instead on a “small is beautiful” system and application alerting strategy. Numerous other considerations and characteristics are important to consider and address when creating management alerts as well. These include the following:
  • Accurate classification and prioritization. Anything threatening the organization’s ability to conduct business, collect money, or keep customers happy needs immediate attention. Other conditions or events—such as resource usage thresholds or slow response times—must be addressed, but probably not right away. IT staff needs guidance on what’s important and the order in which they should tackle pending alerts. From a cost management perspective, putting out expensive fires and limiting losses (both actual and potential) must drive alert scheduling and assignment.
  • Remediation guidance and help. Where automation can cover and resolve alerts, it should do so. In such cases, IT staff need not get involved unless automated resolution fails. Where automation can’t handle an alert entirely on its own, if it can provide guidance based on prior history, known remediation or workaround techniques, or useful resolution strategies, it should provide this information to give IT staff a jump-start and make sure they’re heading in the right direction. In most studies of IT effectiveness and cost control, automation appears as the number one cause for improved efficiency, responsiveness, and—consequently—cost savings.
  • Wherever possible, the management system should automate alert checks as part of the alerting process. Prior to issuing alerts of any kind—especially those requiring human involvement or intervention—it should eliminate as many false positives as possible from the pending alert queue. In general, anything capable of reducing the number of alerts without compromising performance, integrity, or security or adversely affecting key performance indicators (KPIs) is a good thing. Here, again, it’s a matter of separating signal from noise so IT teams can stop wasting time (and money) on low-priority or unnecessary responses.
  • Using a well-informed baseline to drive alerts is key. What may provoke an alert in one situation may be entirely normal in another situation. For example, frequent and repeated use of administrative privileges to make and move copies of applications and data sets is normal during a system migration. Under any other circumstances, it should set red flags ablaze. In such a case, it might be wise to temporarily suspend these security alerts while migration is underway. Likewise, resource consumption and usage thresholds may need to be reset during peak or end-of-cycle seasons to account for higher levels of activity and use. By taking account of current and expected conditions, IT teams can focus their efforts where they’ll do the most good. Ultimately, this delivers better services at a lower overall cost.
Above all, organizations want to avoid generating incidents for their own sake. This can’t help but lead to “alert fatigue,” causing IT staff to close tickets just for the sake of clearing the backlog. It’s more important to make sure tickets get generated—and actions are taken in response—when something meaningful has happened. Anything else leads to wasted time and effort, resulting in higher overall costs of service (or “ticket-closing costs”). A couple of examples should show closing incidents may not always be desirable or represent a good investment of time and effort. Hopefully this raises the question: why issue so many incidents in the first place?

Example 1: Exceeding a Meaningless Threshold

In keeping with standard policy, an alert comes up showing one server disk is 90% full. An IT staffer is assigned, logs in, checks the disk space, and manages to recover a mere 2MB after 30 minutes of investigation and cleanup. She then determines this disk always runs at 90% capacity, and it’s normal and acceptable to stay this way. She closes the incident and moves on to her next assignment. Alas, this simply means the same incident gets generated and elicits the same response next month, and the month after, and… lather, rinse, and repeat forever. The right response here, after clearing the incident, is to file a change request to inform management staff the threshold for this disk is set too low. An investigation to determine a new and higher threshold value that will actually prompt a meaningful response—such as 95% or 99%—might make a useful change. On the other hand, a request for a bigger disk might also be warranted. In general, anything IT can do to reduce the number of spurious alerts will help staffers focus on more important alerts, thereby improving efficiency and saving on overall IT costs.

Example 2: Lost in Alert Overload

Imagine a situation where an organization has created dozens of alerts for its IT staff to handle. These alerts are poorly prioritized and lack sufficient context for IT staffers to separate hair-on-fire situations from smoldering embers. Without a clear sense of which alerts are most dire, IT staffers end up wasting too much time fixing things that don’t adversely affect the business. Worse, these same IT staffers sometimes overlook or don’t get around to resolving tickets costing the organization time, money, and customers because of their inability to place, track, or deliver orders. A proper IT operations management (ITOM) environment includes context with alerts so IT staff can understand the conditions, events, or issues reported within the bigger picture of what’s expected within the organization’s systems. This helps to make sense of how the alert diverges from the norm and what it could portend. Likewise, it’s essential to impose a classification scheme on alerts to help with their priority (e.g., critical, severe, routine, informational) so prioritization and focus naturally fall where they’re most needed. Figure 1 shows a common classification scheme used for fire danger alerts. Figure 1: The level of intensity and potential danger goes up as levels of observed fire danger increase. It’s a good model for IT alerts, too.

Figure 1: The level of intensity and potential danger goes up as levels of observed fire danger increase. It’s a good model for IT alerts, too.

This is an area where data mining, anomaly detection, and metric correlation are all helpful in keeping pending alerts in order of priority. These technologies use artificial intelligence/machine learning to provide added insight, benefit from prior history and current observations, and help organizations put their efforts into closing tickets capable of providing the most positive outcomes. Because some alerts truly threaten business losses or reputational damage, they must take priority over those posing lower risks. This makes constant, ongoing alert classification and prioritization a vital aspect in any good ITOM environment. Ultimately, this also lowers the overall cost of providing IT services and support, where saved time pays a double benefit in enabling staff to work on things capable of improving productivity, profitability, and efficiency.

Ensuring the Health of Your IT Infrastructure

Choosing which alerts to signal in an ITOM environment and how to handle them is essential for ensuring the health of your IT infrastructure and the productivity and efficiency of your operations. The same is true for how and when pending alerts get classified and prioritized. This produces inevitable cost savings and frees up IT resources for more proactive and innovative activities, helping IT teams avoid reacting to events as they occur. When you start researching monitoring and management tools, take a look at SolarWinds® Hybrid Cloud Observability. It can help you reduce alert fatigue, ensure the most urgent issues get the most immediate attention, and lower the overall cost of IT delivery.
Ed Tittel
Ed Tittel is a 30-plus year veteran of the IT industry who writes regularly about cloud computing, networking, security, and Windows topics. Perhaps best known…
Read more