What’s New in Network Monitoring
Evolution means change, and nothing changes as fast as technology. This means evolution is also part of IT, obviously.
In the area of classic IT networks, however, the speed of evolution is not as rapid as, for example, in applications or data storage. However, on top of ever-faster connections such as 400G in the core, quantum leaps such as network virtualization/SDN are a topic of the present.
In addition to evolution, complexity also arises when multi-cloud environments virtually pull the rug out from under the traditional network.
To meet the requirements, network management and monitoring must also undergo permanent development.
“A Brief History of Network Monitoring” Isn’t a Particularly Exciting Topic
There’s definitely an evolution. In retrospect, we’ve moved from “Hello, are you still here?” ICMP packets to KPI gathering through standards like SNMP and WMI, threw in asynchronous messages like Syslog, and tried to somehow put the whole thing into perspective.
Logic came into play (routing loop discovered—were configurations changed?), automation was added (restore the previous configuration), and support for more dynamic, more timely information and actions.
One of the current challenges is more and more areas in IT are converging. This is a challenge for the user—the IT pro—but also for the tools.
Even though in many traditional “old-school” companies, especially in the public sector, the maintenance of Java 6 applications is a reality and requires life-sustaining measures, we’re already a little further along in other areas.
With “new business,” it’s irrelevant from where an application is provided and from where or how it’s accessed. These companies are DevOps-driven and, exaggeratedly, developers just throw code onto some platform and have little or no knowledge of topics such as L3 routing, load balancing, or even security.
This is where trained network admins could help, but the question is “How?” if they don’t know what data is sent from what source to what destination, or how users access the data, and how it will look like in an hour?
In security, when using clouds, one speaks of “shared responsibility.” This should also apply to the area of network administration. You can’t expect the network administrator to manage the entire cloud infrastructure, but an understanding of the quirks of the cloud provider on topics such as site-to-site connections is a requirement.
Besides, it helps if the same applications which are proven and trusted in the traditional on-prem network can be used for controlling and monitoring cloud connectivity. On the one hand, there’s no need to learn yet another tool, on the other hand, this also improves troubleshooting by automatically displaying relationships.
Modern Problems Require Modern Solutions
This proverb couldn’t be truer!
Modern solutions increasingly use APIs to communicate with devices or environments, to stay with the cloud example. Both Azure and AWS offer different ways to communicate, but ideally, the network administrator doesn’t need to delve deeper into this—the monitoring tools do it all by themselves.
The same tools also correlate information from different layers.
At first glance, the network professional might not care if a switch moves packets for an email server. But if there’s a problem with the application somewhere, what do you hear first? “The network is slow.”
In most cases, this statement is false, and we know that, but we have to prove it.
There are several ways a monitoring solution can use to create some application awareness. Traditionally, this has been done using an IPFIX-compatible protocol such as NetFlow, but this comes with limitations. More elegant is deep packet inspection, where applications are identified by the shape of the data packets.
On step further and we see application information is taken directly from servers and dependencies are automatically detected. For example, an agent, either on-prem or in the cloud, detects a local process of a web server is listening on a port, and gathers information about the external IP address sending or requesting data. The same agent sits on the external machine and discovers a database feeding the website.
Putting these records together in a diagram is no rocket science, but it allows you to see at a glance whether a problem is caused by the web server, the database, or the connection between them.
Regardless of the root cause of a problem, and after answering the important question of whose problem it is, the troubleshooting process starts.
This is where various IT teams come together again, because troubleshooting is a generic process.
Tickets are created, documented, and updated. After providing “first aid” (it’s running again) comes root cause analysis and, most importantly, ensuring the situation does not happen again.
If future delivery cannot be guaranteed one hundred percent due to too many moving parts out of control, it’s crucial to enter the steps for solving the problem into a knowledge database to minimize the meantime to resolve.
Again, a modern solution that communicates with one or more IT service management systems, or maybe even fully integrate, helps a lot.
As soon as a situation occurs, the system reaches out to the responsible team and automatically creates a ticket containing all the required information. For complex problems, an ITSM platform is a perfect place to involve different IT teams without losing the overview.
The same tool could simplify reporting and is a great place for a knowledge database.
What Will the Future Bring?
Now, too many KPIs are collected too frequently and kept for too long.
This is due to the underlying complexity, and the data points are used to show correlation and make a true statement without guessing.
We’re at the stage where higher-level telemetry data is being collected to say, for example, “This problem has occurred four times in the last six months, and the average time from detection to ticket closure is three hours.”
In the future, the data collected can calculate the probability of whether and when the next fault will occur. Marketing departments insert the keywords machine learning and artificial intelligence here.
Likewise, we no longer need dashboards with hundreds of diagrams. As long as the data throughput of a device is below an automatically calculated threshold, the actual amount of data is only interesting for reporting.
As long as a switch moves its packets, it’s a good boy, and details don’t matter.
The monitoring solution could therefore simply display a large green sphere and automatically switch to the relevant red sub-element in case of a problem.
So, we know exactly when to put away your nice cup of coffee!