I remember the simpler days. Back when our infrastructure all lived in one place, usually just in one room, and path monitoring could be as simple as walking in to see if everything looked OK. Today’s environments are very different, with our infrastructure being distributed all over the planet and much of it not even being something you can touch with your hands. So, with ever increasing levels of abstractions introduced by virtualization, cloud infrastructure, and overlays, how do you really know that everything you’re running is performing the way you need it to? In networking this can be a big challenge as we often solve technical challenges by abstracting the physical path from the routing and forwarding logic. Sometimes we do this multiple times, with overlays, existing within overlays, that all run over the same underlay. How do you maintain visibility when your network infrastructure is a bunch of abstractions? It’s a challenge, but I have a few tips that should help if you find yourself in this situation.
Know Your Underlay
While all the fancy and interesting stuff is happening in the overlay, your underlay acts much like the foundation of a house. If it isn’t solid, there's no hope for everything built on top of it to run the way you want it to. Traditionally this has been done with polling and traps, but the networking world is evolving, and newer systems are enabling real-time information gathering (streaming telemetry). Collecting both old and new styles of telemetry information and looking for anomalies will give you a picture of the performance of the individual components that comprise your physical infrastructure. Problems in the underlay effect everything, so this should be the first step you take, and the one you're most likely familiar with, to ensure your operations run smoothly.
Monitor Reality
Polling and traps are good tools, but they don’t tell us everything we really need to know. Discarded frames and interface errors may give us concrete proof of an issue, but they give no context to how that issue is affecting the services running on your network. Additionally, with more and more services moving to IaaS and SaaS, you don’t necessarily have access to operational data on third-party devices. Synthetic transactions are the key here. While it may sound obvious to server administrators, it might be a bit foreign for network practitioners. Monitor the very things your users are trying to do. Are you supporting a web application? Regularly send an HTTP request to the site and measure response time to completion. Measure the amount of data that is returned. Look for web server status codes and anomalies in that transaction. Do the same for database systems, and collaboration systems, and file servers… You get the idea. This is the proverbial canary in a coal mine that lets you know something is up before the users end up standing next to your desk. The reality is that network problems ultimately manifest themselves as system issues to the end users, so you can’t ignore this component of your network.
Numbers Can Lie
One of the unwritten rules of visibility tools is to use the IP address, not the DNS name, in setting up pollers and monitoring. I mean, we’re networkers, right? IP addresses are the real source of truth when it comes to path selection and performance. While there is some level of wisdom to this, it omits part of the bigger picture and can lead you astray. Administrators may regularly use IP addresses to connect to and use the systems we run, but that is rarely true for our users, and DNS often is a contributing cause to outages and performance issues. Speaking again to services that reside far outside our physical premises, the DNS picture can get even more complicated depending on the perspective and path you are using to access those services. Keep that in mind and use synthetic transactions to query your significant name entries, but also set up some pollers that use the DNS system to resolve the address of target hosts to ensure both name resolution and direct IP traffic are seeing similar performance characteristics.
Perspective Matters
It’s always been true, but where you test from is often just as important as what you test. Traditionally our polling systems are centrally located and close to the things they monitor. Proverbially they act as the administrator walking into the room to check on things, except they just live there all the time. This design makes a lot of sense in a hub style design, where many offices may come back to a handful of regional hubs for computing resources. But, yet again, cloud infrastructure is changing this in a big way. Many organizations offload internet traffic at the branch, meaning access to some of your resources may be happening over the internet and some may be happening over your WAN. If this is the case, it makes way more sense to be monitoring from the user’s perspective, rather than from your data centers. There are some neat tools out there to place small and inexpensive sensors all over your network, giving you the opportunity to see the network through many different perspectives and giving you a broader view on network performance.
Final Thoughts
While the tactics and targets may change over time, the same rules that have always applied, still apply. Our visibility systems are only as good as the foresight we have into what could possibly go wrong. With virtualization, cloud, and abstraction playing larger roles in our network, it’s more important than ever to have a clear picture of what it is in our infrastructure that we should be looking for. Abstraction reduces the complexity presented to the end user but, in the end, all it is doing is hiding the complexity that’s always existed. Typically, abstraction increases overall system complexity. And as our networks and system become ever more complex, it takes more thought and insight into “how things really work” to make sure we are looking for the right things, in the right places, to confidently know the state of the systems we are responsible for.