Home > Observability Again? Oh, Yes.

Observability Again? Oh, Yes.

Observability Again? blog post image
I’m a bit late to the game in writing about observability, but I come with a great excuse: since March, I’ve travelled the world (well, at least four out of the seven continents) to discuss this observability thing with our Partners. Later, as we were able to disclose more details, we discussed it with customers, too. A lot’s happened in the past four months.

Did We Define Observability?

Let me tell you, defining observability can mean a lot, as each vendor in the field has their own interpretation. My first step was to open a translator to check what it means in German, and the result was “Beobachtbarkeit.” Yes, it’s a word. No, I don’t understand it. And I’m German, by the way. You’ll find a lot of useful information in the articles written by my esteemed colleagues Tom and Chrystal, and here’s my take. By tradition, SolarWinds is a market leader in IT management, and I’d consider us the global standard in network monitoring. In IT management in general, we tend to collect a lot of information. In fact, “a lot” is a mild understatement. We collect thousands of metrics from all kinds of devices, applications, connections, desktops, user interactions, traces, queries, environmental sensors, and even ships. Yes, there are loads of ships out there swimming around with SolarWinds® stuff on board. But what do we do with all this information? Though we’ve added loads of features over the years to improve the ability to make sense out of it, we’ve left it to the user to create meaningful dashboards, reports, and alerts. With observability, we’d like to change this.

Let’s Talk Alerts

Each IT management system comes with loads of alerts out of the box, and their sole purpose is to provide an idea of what data to collect and to warn teams when certain situations occur. What most users don’t really understand is these out-of-the-box alerts are merely templates—or building blocks—designed to help teams create their own versions. Instead, many teams keep them active. The result is often a management system spamming teams with dozens of unnecessary alerts in a single minute. Like, every minute. Unfortunately, it’s part of human nature to ignore repetitive information if it doesn’t concern us—and when a “real” alert arrives, we often ignore it too. A fire in the data center? Ah, surely a false alarm—that happens only in France. Let me give you an example of how to make it better. Let’s say there’s a virtualization host, and its CPU utilization is around 90%. It sounds scary in the first moment, but if everything else is working fine and this condition is steady for several months, there’s nothing scary about it. In fact, it’s a smart utilization of resources and is most likely extremely energy efficient. No need for an alert. But what if the utilization increases to 92% or even 95%? Is this scary? Well, wouldn’t it be great if a system could automatically check for the reason why it increased? If the check finds someone added two or three more VMs to the host, an increase in CPU utilization is expected behavior. Not scary. Not an alert. Just a notification to the IT service management (ITSM) system for automated change control. But what if there’s a database running in circles and throwing deadlock after deadlock—the evil spinning loop kind? Scary? Oh, yes. So what should the system do? Send an alert to the responsible team (the DBAs, not the whole IT team) with statistics about the victim and the survivor of the deadlock and all necessary supporting information so the resolver group doesn’t have to waste time investigating. They see the situation, some background stats, suggested solutions, and know what actions are required. The situation above fits into our interpretation of what an observability solution should do. We aim to reduce alert fatigue with help of artificial intelligence (AI), and I just gave you a real-world example of AIOps.

Next: What Does It Mean for the Business?

I already mentioned our customers use our tools to collect tons of data points. One of the many challenges is to make sense out of it. Let’s say we’re monitoring 2,000 CPU cores. 1,900 run distributed business applications, and the remaining 100 run the Minecraft server for the IT team. So basically all of them are in production. Next to it, there are a couple of redundant links between offices—maybe there are a few SD-WAN routers in place already. There’s no question: they’re important to the business. Twenty-something database instances hold customer data, inventory, internal information for HR, maybe even my PTO requests—all good and valuable information for the business. But what is the business, exactly? And what counts as a key performance indicator (KPI) and not just a random metric? If my business is a hotel, the primary KPI would be how many rooms are booked per day. If my business is selling oranges, I’m interested in how many boxes left the factory during the last hour. The challenge for an IT management system is obvious: it doesn’t know what the business is, so it can’t know your very important KPI. But with a little machine learning, we could probably figure out what values a user looks at, how often they look at them, and in what order. This could help when suggesting a specific dashboard or—even better—an automated report.

A Little Science Fiction

The next step for such a system would be to autonomously correlate various metrics (like our CPUs from above) and interpret the results so the user—all the way up to the C-level—can see their KPI. If there’s an anomaly, the IT teams will see the visual impact of missing CPU cycles on the key performance indicator. Yes, this situation still sounds like science fiction—but for how much longer? Let me tell you: it also fits into our vision of observability, and it’s something businesses out there urgently need, given the increased complexity of IT and the ongoing fight against service-level agreements (SLAs). Observability is the next logical step for us and is built on our experience of more than two decades of IT management. Tom explained it nicely in his article. Right now, there aren’t many visible changes between the new SolarWinds Hybrid Cloud Observability solution and the well-established Orion® Platform, but Chrystal explained a lot of what’s happened in the background and how these changes will lay out the path for the future. Let me finish with the words of our CEO: “At SolarWinds, our purpose is to enrich the lives of the people we serve.” And this is exactly what’s going to happen.
Sascha Giese
Sascha Giese holds various technical certifications, including being a Cisco Certified Network Associate (CCNA), Cisco Certified Design Associate (CCDA), Microsoft Certified Solutions Associate (MCSA), VMware…
Read more