/ Too Many Alerts Can Hurt Your Business

Too Many Alerts Can Hurt Your Business

May 16, 2021

Page Contents

As the famed psychologist Abraham Maslow once said, “If all you have is a hammer, everything looks like a nail.”

IT monitoring and observability systems make it easy to create alerts for all kinds of events and conditions. It can be tempting to create alerts for all of them, but this can quickly lead to problems:

If you create a plethora of alerts for IT staff, they’ll spend the bulk of their time responding to them.

If IT staff must deal with mountains of alerts, how will they know which ones are important and which can be handled later?

An intelligent approach to alerts is crucial. Without it, IT monitoring and observability systems may inadvertently hinder the productivity of IT operations rather than enhancing it.

Choosing and Using Alerts Wisely

Organizations often apply a “more is better” mindset when creating alerts. IT teams should instead focus on a “small is beautiful” approach that reduces the number of notifications by grouping correlated alerts and automatically sifts through them to bring attention only to those alerts that indicate a real problem. There are a number of questions to consider when pursuing this strategy.

Accurate classification and prioritization: IT staff need guidance on what's important and the order in which they should tackle pending alerts. Anything threatening the organization’s ability to conduct business, collect money, or keep customers happy needs immediate attention. Other conditions or events—such as resource usage thresholds or slow response times—must be addressed, but probably not right away. From a cost management perspective, putting out expensive fires and limiting losses (both actual and potential) must guide the scheduling of alerts.

Remediation guidance and help: Automation should cover and resolve alerts where it can. In such cases, IT staff need not get involved unless automated resolution fails. Where automation can’t handle an alert entirely on its own, it should provide guidance based on prior history, known remediation or workaround techniques, or useful resolution strategies. This gives IT staff a jumpstart and makes sure they're heading in the right direction. Automation can also help by logically grouping related alerts. For example, not having enough memory in a server could cause an application to slow down, ultimately impacting network traffic and end-user experience. This may result in multiple alerts from multiple systems. Clustering all the alerts together and providing a chronological list of events in affected entities can go a long way in accelerating issue identification and resolution. In most studies of IT effectiveness and cost control, automation appears as the number one cause for improved efficiency, responsiveness, and — consequently — cost savings.

Wherever possible, the management system should automate alert checks as part of the alerting process: Before issuing alerts — especially those requiring human intervention — it should eliminate as many false positives as possible from the pending alert queue. In general, anything capable of reducing the number of alerts without compromising performance, integrity, or security or adversely affecting key performance indicators (KPIs) is a good thing.Iit’s a matter of separating signal from noise so IT teams can stop wasting time (and money) on low-priority or unnecessary responses.

Using a well-informed baseline to drive alerts is key: What may provoke an alert in one situation may be perfectly normal in another. For example, frequent and repeated use of administrative privileges to make and move copies of applications and data sets is normal during a system migration. Under any other circumstances, it should set red flags ablaze. In such a case, it might be wise to temporarily suspend these security alerts while migration is underway. Likewise, resource consumption and usage thresholds may need to be reset during peak or end-of-cycle seasons to account for higher levels of activity and use. By taking account of current and expected conditions, IT teams can focus their efforts where they’ll do the most good. Ultimately, this delivers better services at a lower overall cost.

Above all, organizations want to avoid generating incidents for their own sake. This can’t help but lead to “alert fatigue,” causing IT staff to close tickets just to clear the backlog. It’s more important to make sure tickets get generated—and actions are taken in response—when something meaningful has happened. Anything else wastes time and effort, increasing overall service costs (or “ticket-closing costs”). A couple of examples should demonstrate how closing incidents may not always be desirable or represent a good investment of time and effort. Hopefully this raises the question: why issue so many incidents in the first place?

Example 1: Exceeding a Meaningless Threshold

In keeping with standard policy, an alert comes up showing one server disk is 90% full. An IT staffer is assigned, logs in, checks the disk space, and manages to recover a mere 2MB after 30 minutes of investigation and cleanup. She then determines that it's normal and acceptable for this disk to run at 90% capacity. She closes the incident and moves on to her next assignment. Unfortunately, this means the same incident gets generated and elicits the same response next month, and the month after, and… lather, rinse, and repeat forever.

A smart, anomaly-based alerting system would alleviate this situation by recognizing the 90% capacity as "normal" and ceasing to generate alerts at this baseline. If such a system is not deployed, the correct response here - after clearing the incident - would be to file a change request to inform management that the threshold for this disk is set too low. Establishing a higher threshold value that will prompt a meaningful response, such as 95% or 99%, might be useful. On the other hand, a request for a bigger disk might also be warranted. In general, anything IT can do to reduce the number of spurious alerts will help staffers focus on more important alerts, thereby improving efficiency and saving on overall IT costs.

Example 2: Lost in Alert Overload

Imagine a situation where an organization has created dozens of alerts for its IT staff to handle. These alerts are poorly prioritized and lack sufficient context for IT staffers to separate hair-on-fire situations from smoldering embers. A monitoring and observability system that can correlate related alerts from different aspects of the environment and provide context-based information can reduce the number of alerts and focus on what's important. Without a clear sense of which alerts are most dire, IT staffers waste too much time fixing things that don’t adversely affect the business in a significant way. Worse, these same IT staffers sometimes overlook or don’t get around to resolving tickets costing the organization time, money, and customers.

A proper IT operations management (ITOM) environment includes context with alerts so IT staff can understand the conditions, events, or issues reported within the bigger picture of what’s expected within the organization’s systems. This helps to make sense of how the alert diverges from the norm and what it could portend. Likewise, it’s essential to impose a classification scheme on alerts to help with their priority (e.g., critical, severe, routine, informational) so prioritization and focus naturally fall where they’re most needed. Figure 1 shows a common classification scheme used for fire danger alerts.

Figure 1: The level of intensity and potential danger goes up as levels of observed fire danger increase. It’s a good model for IT alerts, too.

This is an area where data mining, anomaly detection, and metric correlation are all helpful in keeping pending alerts in order of priority. These technologies use artificial intelligence (AI) and machine learning (ML) to provide added insight, benefit from prior history and current observations, and help organizations put their efforts into closing tickets capable of providing the most favorable outcomes.

Because some alerts truly threaten business losses or reputational damage, they must take priority over those posing lower risks. This makes constant, ongoing alert classification and prioritization vital to any good ITOM environment. Ultimately, this also lowers the overall cost of providing IT services and support, where saved time pays a double benefit in enabling staff to work on things capable of improving productivity, profitability, and efficiency.

Ensuring the Health of Your IT Infrastructure

Choosing which alerts to signal in an ITOM environment and how to handle them is essential for ensuring the health of your IT infrastructure and the productivity and efficiency of your operations. The same applies to how and when pending alerts get classified and prioritized. This produces inevitable cost savings and frees up IT resources for more proactive and innovative activities, helping IT teams avoid reacting to events as they occur.

Researching monitoring and management tools? Take a look at SolarWinds Observability Self-Hosted (formerly known as Hybrid Cloud Observability). It can help you reduce alert fatigue, ensure the most urgent issues get the most immediate attention, and lower the overall cost of IT delivery.

Tags:

alerts

cost

it operations management

monitoring

Jason Tolu

Jason Tolu is a Principal Product Marketing Manager at SolarWinds and focuses on creating messaging, developing marketing strategy, and enabling sales for the SolarWinds full-stack…

Tags:

alerts

cost

it operations management

monitoring