Whether you’re new to monitoring or trying to show someone the ropes, sometimes the hardest part of understanding the technology is figuring out what is actually going on behind all the graphs and buttons. Head Geek Leon Adato and Product Manager Chris O’Brien talk through the details of monitoring and separate the hype from the hypervisor. Leon and Chris talk about several standard monitoring techniques and protocols, how each one works, and real-world use cases.
Welcome to SolarWinds Lab. I know we usually come on strong with the latest and greatest new features or cool monitoring tricks, but today, we are going back to the basics.
I don’t know if I can actually not talk about what our products can do.
Come on! Before you were this great and mighty techno-evangelista that you are today, you were a network engineer. So, you know, the thing I love about lab is that we can really get our geek on. And I know that the audience is going to always come with us, but I always wonder, what about the folks who are kind of new to this? What are they thinking about this whole thing?
Yeah, I mean, we run into that person all the time at conventions, user groups, and when we sit down with new customers. Right, so this episode will help you if you’re just getting started in monitoring or if you’ve been doing it for a while, but you might have missed some of the fundamentals along the way. Yeah, it’s also good for people who are interested in monitoring, even if they won’t be using or managing monitoring themselves. For example, your manager or people on other teams who are requesting monitoring, but they don’t know how it works exactly. So, where do you want to start?
It depends. When you peel away all the shiny graphics and awesome automation, monitoring has always come down to the same set of techniques and protocols. So why don’t we start with that?
Okay, good plan.
Hi, I’m Leon Adato.
And I’m Chris O’Brien. Welcome to SolarWinds Lab. We’re just going to spend the next few minutes or so talking about what monitoring is, the basics of how it works, and what it really means to you in a real-world use case. As always, when you join us live, you can ask us questions in that chat box over there. Yeah. That’s because—now, if you don’t see that chat box, that’s because you aren’t watching us live. And to do that, you want to go to www.lab.solarwinds.com and register for updates on new episodes, as well as tell us what you would like to see in future episodes. I’d also like to mention that everything we’re going to talk about today, you can find in an eBook called “Monitoring 101” that we’ve created just for this.
So scribbling notes not required.
Right, exactly. So you can just sit back and listen, and let the wisdom of monitoring glory just wash over you.
So, I think any conversation about monitoring basics has to start with…
Can’t get much more basic than Ping.
Good old Ping.
Most people who’ve been with a computer for more than—I don’t know—15 minutes probably are used to it. You go to a DOS prompt. Yes, it’s called a DOS prompt. Gosh-darn youngsters! And you type ping and an IP address, or a machine name, and it will go out. And what is it really doing here? I mean, what’s—
It’s sending a packet to the device to see if that device will respond, and it times how long it takes for that response to get back.
Right. So the important thing is not just it’s there, but also how many replies, because sometimes you can get one reply, one drop, one reply, one drop.
So you repeat and get more data.
Right, and then also, how fast that response time is. Now by itself, it’s like, is it there? Okay, great. And lots of people start their monitoring experience by writing their own little ping utility or script or what have you. But what I want to do is I want to show sort of the result of ping. If you take a look here, what you see is this graph here. And I want to specify that we are being tool agnostic. It’s just that the tools that we happen to have in our toolbox are some SolarWinds tools.
Yeah, we wanted to graph ping, and this is how we graph ping.
Right, almost every monitoring tool on the planet probably has some sort of ping capability. In any case, when you collect the ping information and you store it in a database, what you get the ability to do is, you get to track it over time now. So you can see not just that it responded at these points in time but also how quickly or slowly it responded, as well.
Yep, and when that changed, which is a critical piece of information telling you about how that infrastructure is doing, how the end node is doing, plus how the path to get there is doing.
Right. And there’s a couple of interesting use cases. In one situation that I was in a couple years ago, we had a 10 Mb circuit. But the provider had only configured half that much speed, and so what we were seeing was exactly one packet out of every two was dropping. So this term would say, “It’s there. It’s gone. It’s there. It’s gone. It’s there. It’s gone.”
Yeah, so packet loss detection, one of the key use cases for ping.
Right. In addition to the normal stuff, which would be, you know, is it there? When did the outage start? This stopped responding, say, three minutes before we detected the application was down.
Yeah, and speaking of, is it there and when exactly that timing occurred, ping is really light. So it’s one of the ones where you can send it constantly and have great granularity as, this happened at 3:06 pm rather than 3:30 or 4:00.
Right, exactly. So I think the next stop we have to hit is SNMP.
So SNMP has a particular structure and my friend, Steve Clausen, actually gave a really good example of that. So I wanted to sort of bring that down. But all props go to him. So SNMP pulls data out of systems, but it has a very specific hierarchical structure. So, on the box is an SNMP agent, which is collecting data and putting it in different buckets. And what you’re normally going to see with SNMP is this number—184.108.40.206.4.1.5, or something like that. And a lot of people get overwhelmed by that. It’s sort of like IP addresses on steroids. Like, “I’m never going to remember that!” And you don’t have to remember that. But the way that Steve explained it was when you’re looking at an SNMP list or hierarchy, what it’s really doing is telling you how to get to my house. So if you look at the screen here, you can see that—what it’s saying is, take the first left, number one, and iso. Okay, then take the third left at org. Take the sixth left at dod. Take the first left at internet. Take, let’s say, the second left at management. It’s going to take a minute. Take the first left at MIB. And so on and so forth. Until you finally get down to the actual data point, which could be CPU, or name, or whatever. So, when you do an SNMP thing, you’re going to be using this number. And that’s what that’s going to be referring to. And this is called a MIB—a management information block.
Yeah, and we can see that number constructed right there on the screen.
We took 220.127.116.11.2.1, and that’s how you got to the information you need to get to. >>Whatever it is. Exactly.
That’s the structure of it, but there’s two very specific things you can do with SNMP. There’s polling and there’s trap. So, let’s talk a little bit about polling.
Yep, okay. So, there’s a number of different commands for how you get that SNMP data off that infrastructure component you’re trying to get data about. But the one most people care about is SNMPGET.
Right. And it’s pretty simple, right? You do snmpget, that number, and the box should respond. So, for example…
Plus a really crappy password.
A really bad password. A really bad open text password. We won’t get into the SNMP versions or anything like that. So an SNMP command might look something like this, snmpget, version 1—that’s SNMP version one—the community string is public because of course it is.
Very impressive, very secure
Yes, I know. Then you have things that are a little bit more recognizable: the 10.199.4.3. And then, notice, I didn’t use a number. I used the name because you can actually, if you go back here, I can actually use the shorthand of whatever it is that I’m going, for example, if I’m going to system and I want the object ID and I want, let’s say, cold start or whatever. So instead of saying…
…You’ve got to use the numbers man.
I know, I know.
You’ve got to use the numbers.
So 18.104.22.168.2. whatever, whatever, whatever. And that can be called RAD—RAD!—ISM-900:coldStart. You could do that. So I did, because this is a pretty standard one. The sys description, system description dot zero. And the system, this IP address, responded with the name of the system and what kind of machine it is and so on.
And very, very snappy. Right?
Right, and that’s the other thing. Just like with ping, SNMP is very, very tight. It’s very small. It’s very efficient. And so it’s nice to use almost every device from your network devices to your servers, to your coffeepot, to your Internet of Things thing, or whatever. Most of them will support SNMP at some level.
Yep. It’s most well known for networking devices. That’s where it’s traditionally enabled by default for a bunch of network gear. In Windows, you add your own executable to get that functionality. But very ubiquitous.
Right. Now the one problem is that you can imagine on a server, for example, you’re going to collect a lot of different things. It’s got six CPUs and it’s got 10 different memory counters that I want to get. Am I really doing an SNMP get this and then get that and then get this and get that?
That’s why they introduced the GETBULK command. So you can get a whole bunch of these things at the same time. Another option is SNMPWalk. What SNMPWalk is about is really going to the first OID and asking for that OID, and then asking for what the next one is. So the thought process there is: I don’t know everything that is specified on that piece of equipment, what information is available where. So I’m going to ask for the first piece, which is thereby standard, and then I’m going to ask the device to give me whatever’s next.
And by walking through that whole thing, we can get all of the data and have that all available to you.
So, on the screen, I have a SNMPWalk utility. Here’s the IP address that I’m looking for. Again, my very secure community string, scan, and it’s going to go through a significant number, about 1,310,000 object IDs. And what you end up with is— there we go. So there’s your walk. It starts at the first number that responds. And you go down through and you get the IDs. So a lot of times a system, a monitoring tool, will do that once to a machine to find out which things it responds on, and then it will simply record that in the database, and only use the correct ones from that point forward.
Yeah, clearly SNMPWalk is a whole lot heavier than a simple SNMPGET. So SNMPWalk to understand it, and then SNMPGET to be efficient. >>Very good. So that’s getting. That’s when I want to know CPU, and I ask it every five minutes, and I get that stuff. I store it on the database. But what about traps?
Yeah, so I mean, if you think about how you’re getting something every five minutes, what happened in between? Where sometimes you have specific events where you want notification right away. You want triggered notification. You don’t want to just pull that information where you’re waiting for the data that you need. So they introduced SNMP TRAPs. SNMP TRAPs is sort of the reverse. So rather than your management entity pulling the SNMP agent, which is on your router or whatever it is, the router itself will send information to the management entity when that information becomes available. So an interface goes down, you know about it right away.
Exactly, yeah. And it’s just in time information. That’s the good part. So you don’t have to keep asking, “Did you just restart or anything like that?” The bad part is it’s not guaranteed delivery.
Right, so unfortunately, if the system goes down fast and it tried to send a message, but it couldn’t just do it, then of course you’re not going to get that. But there’s enough other evidence, things that you can pick up. It starts back up and says, “Oh by the way, I had a cold start. I didn’t expect to be shut down a minute ago.” And so you can tell that things happened after the fact.
Yeah, and as with most other monitoring technologies, really, what you want to do is balance the two together. So SNMP TRAP may send you a notification that an interface is down, but if that’s the uplink to the rest of your network, maybe that SNMP TRAP doesn’t ever go out, doesn’t ever reach you. So then you’re relying on your other tools like ping or SNMPGET to tell you that they have lost connectivity. Okay, so SNMP TRAP is one way to send triggered information, but if you’ve got a lot of information like log files, the next logical step is to use syslog. So syslog, again, is triggered, but lends itself to sending a whole lot more data. A whole lot more in terms of many, many messages or the length of the messages?
Both of those things.
Okay. And the other thing that I like about syslog is that it’s very freeform. You know, we talked a lot about how SNMP has this very hierarchical structure. And, oh, by the way, if you were looking for a piece of data and it’s not in the MIB, you’re out of luck. You can’t just make it up. But with syslog, there are actually clients you can install on your boxes. Now if you have a Linux or UNIX box, it’s built in. If you have a Windows box, you need a client. But you can actually say when such and such happens on the box, you can trigger a syslog message. And it will say anything you want it to do.
Yep. And even easier for network devices, you tend to just go into the network device and say, “Point your syslog over to this server,” whatever your server is, and then you’re getting all the data.
Right. And for network devices, you can’t really customize them, although you can specify— syslog has different levels. There’s the warning level, the debug level, the inform level, so you can group by those. Yep, like threat level red, threat level yellow.
Ahhhh! Everyone. Yeah. So you can say, “I only want those things there.” But sometimes there’s interesting information in the informs. I know that all of our security professionals and other professionals are very, very focused on syslog messages.
Yeah, it’s fantastic with firewalls, because oftentimes someone will come in, or you get a help desk request that says, “I can’t reach such and such application.” And sometimes the simplest way, rather than routing through all of your firewall rules, is to just check the logs. Was that guy’s IP blocked for anything in the past hour or whenever he was requesting it? And syslog makes that really easy. All the data comes from the firewall to your administration thing. It’s a simple search of a text string.
Right, and that’s an important point, is that you have many, many devices, whether they’re firewalls or network devices out there. They’re all sending their syslog into a central listening server that’s collecting it together. That lets you dumpster dive through looking for trends or patterns whether you’re using a SIEM, a security information event management system— I think I got that right.
Pretty good, pretty good.
Which collects all the logs together. Or you just have a central location for all your messages, or whatever it is that you’re looking to do. Some messages, especially on the network side, only appear as syslog. I’ve always had a lot of fun with these. One is BGP neighbor down, when you’re setting up a VPN connection. And the other one is—hopefully no one’s dealing very much with spanning tree anymore, but if you are, when there’s elections in spanning tree, that only appears as a syslog message. It doesn’t appear as anything else. But it can be very, very useful.
Yeah, absolutely. Building rules based on what you’re seeing in the logs. And when you do a troubleshooting session, often you’ll check the log, see if anything quirky happened. And sometimes you’ll find, oh, there was a spanning tree election I did not expect to happen, or BGP neighbor-ship, or OSPF, or whatever it is. A lot of esoteric bits of information are on this syslog that are not anywhere else. So as you go through the troubleshooting session, you find some interesting piece of data that clues you into a problem, you can actually use that ahead of time for the future problem by setting up search rules and so forth.
Excellent. So, what I want to do is take a brief side trip into the land of WMI. Now, Windows Management Instrumentation is obviously for Windows machines only. You’re not going to find it running on your routers or your UNIX boxes.
It’s the first word: Windows.
Yeah, it is sort of key in there. And for the most part, what you’re doing is collecting the same stuff that you could get with SNMP. There’s a couple of reasons why you would care about WMI, though. The first one is that WMI is native to Windows. It’s on more or less by default, whereas SNMP, you have to turn on. You have to have some— if you have a large environment, you have to have some sort of effort to turn on SNMP.
Yeah, so generally for network devices, you’re thinking SNMP. Generally, for Windows devices, you’re thinking WMI. >>Right. That’s the pro side of it, and you can get some really good information. In fact, you can get more information off of Windows boxes than sometimes SNMP can. However, WMI by default requires about 100 bajillion ports to be open.
Yeah, it’s a little ridiculous. Always irritating to open those ports for your firewalls. You can specify static ports, but that gets into a whole other mess. But the nugget here is, if you’re having trouble getting WMI data back, you need to be checking your firewall.
Right, exactly. But more, and also, again, I’m a fan of SNMP. On Windows boxes, you can go WMI as long as there’s no issue with it. But you should know that both techniques exist for those kinds of servers. So that really covers the basics of the basics. Now I want to start to dovetail into some of the more advanced monitoring topics, but they’re still part of monitoring 101. They’re just not sort of the foundational layer of monitoring.
Yep, so let’s talk about NetFlow first. I love NetFlow. As a network administrator, NetFlow answers the question for me, “So if my interface is being heavily utilized. What’s using that interface?” So let’s take a look at our interface here and take an example of what that looks like. So here, we’re looking at an interface on this device called BOWAN. That’s body odor WAN.
Of course it is.
GigabitEthernet0/1.2022. So on this specific virtual interface, or sub-interface rather, we can see the top left graph is telling us, what are the most talkative endpoints? Which endpoints are using the most traffic? And when are they using that traffic? And the top right graph shows us conversations. A conversation is: I’m talking to you, and I’m talking to you on specific ports. So web server, MySQL server, MS SQL server. What application is involved in our conversation?
Right. Yeah, the conversation piece is the part that I actually like. I know that as a network guy, you want to know—the circuit is pegged. You know that it’s passing this enormous amount of data. Now who’s the culprit? Is it Fred in accounting who’s streaming Netflix? Or is it something else? It’s the authentication traffic that’s going to a really chatty Active Directory server, for example. So you want to know that. But what really excites me about it is the other piece, the conversation pieces, because NetFlow by default, can be broken down as, for example, the top 10 conversations here. And I can see that between this server and that server, that device and that device, even if I’m not monitoring those devices, I can tell what kind of data, what kind of information is being passed and which interfaces.
So you like a global view.
Whereas I’m zoned into a specific interface and fixing that interface. You’re saying, “What is my entire network doing, and what are the devices that are on that network doing, even if I don’t monitor them directly?” Yeah, because a lot of times as a monitoring engineer, I’m asked, “How much web traffic are we seeing on this interface?” Or how much whatever? And again, here we have conversations between two endpoints, but there we can also take the data and look at the different kinds of data. So you can see that the web traffic, port 80, is happening. It’s passing 5.6 gig currently, ingress, and then another 14.5 gig egress. And I can see who is using up all that traffic. Again, is it Fred in accounting who is binge watching Walking Dead, or is it something else that’s going on? The other nice thing about this, from a security standpoint, is that I can look at traffic and say, “Wait, we have no business talking to that endpoint at all.” And you can start to pick that up from NetFlow, where you might not see that with other monitoring types.
So, where does this NetFlow data come from? Good questions. It is a closed—not a closed protocol. It is a protocol developed initially by Cisco, although lots of network devices now support it. And it comes from one particular exporter. So you have a router that sits somewhere on your network, typically close to the center, and it is watching the traffic coming through. And every so often, it will throw out a bundle of conversation data to a machine that’s listening, like our NetFlow server. So every so often, the router will pass this big chunk of data, and it will pick it up. It will parse it out and then it will be able to display this stuff. You can do it with multiple devices across your environment, but you’re not monitoring every single device. You’re only monitoring a few things in that soft chewy center of your network. Yeah. The WAN edges—sort of the center of the WAN— but WAN edge, each of your locations is a common place to deploy NetFlow exporting. The other thing to note is, you’re not sending all of the data. Right? Because of course, your link is going to be heavily utilized if you double all of your data. So what you’re really doing is, the router looks at all of the packets and it tracks a couple of key pieces of information, and then exports only that. And there’s some other summary functions that happen in there, as well.
It’s a very small portion of the total data.
Exactly. It’s still a lot of data. We have to be honest about that.
There’s a lot of data. You have to have the right kind of hardware to monitor and collect this. But once you have it, the insight you can get is invaluable as a network engineer.
You need to have Flow monitoring in your network regardless of where you get it from. You need to have Flow monitoring.
All right. So it’s time to talk about IPSLA. IPSLA, I think, is the most underutilized monitoring technique that we have in this list.
You’ve been waiting this entire episode just to talk about IPSLA. I have been waiting to talk about IPSLA. I love IPSLA. And very few people use it.
So let’s get a couple basics out of the way and then we’ll talk about why I love it.
The basics are: It stands for IP Internet Protocol Service Level Agreement. So, it’s focused on service level agreement, but that’s not really the only thing you use it for. In fact, I would say that’s kind of vestigial. The next thing we should mention about it is, it’s synthetic and it’s Cisco proprietary. So, it’s sourced from your Cisco devices. Cisco devices are what you need to be running IPSLA. Juniper and others have similar sort of peers, but it’s not quite called IPSLA. >>Right.
And it’s synthetic. What synthetic means is, rather than like WMI or SNMP, sending a question to the network device and having that network device tell you its answer—instead, the synthetic monitoring sends synthetic traffic that looks sort of like your user traffic, and measures the results on that traffic. Which is fantastic. It tells you what actually happens to traffic that you send across the network, rather than what network devices think is going on.
Right, and it tells you at 2:00 in the morning, when there may not be any actual users on the network, so you can find out at 2:30 in the morning that there’s a problem, and you can get a call or a trouble ticket in place, rather than waiting till 9:00, when everyone hits the office, and all of a sudden, that’s when you find out that whatever it is, is broken.
Yep, so that’s a big difference between synthetic versus regular polled monitoring. The next thing I like is that you can use IPSLA to sort of mimic traffic. So rather than just sending that ping, for instance, to get latency you can send TCP traffic on port 80 or TCP traffic on port 443 to get an idea of what web browsing feels like. And you can see the behavior you have in place on your network for your web browsing traffic, if you’ve got QS or what have you that treats different traffic differently. That’s super important.
Right. Now, we’re dancing around, I think, the big issue, which is that IPSLA is predominantly used for monitoring voice traffic. Yeah. That’s true.
It’s mostly used—I mean, when you hear about IPSLA, it’s mostly in the context of call quality and whether your— I’m not saying that’s the only use case. And that’s an important point especially in monitoring 101. You’re going to hear that IPSLA, it’s for phones. No, it’s for a lot of things. But the predominant usage of it is with voice traffic.
Yeah, my take on that is that, you run IPSLA when you want to make sure that your network is running right. People realize they need to do that when phones are dropping. But they should be doing that all the time. They really should. And the other thing that’s great about IPSLA is how granular it gets. So whereas ping, you’re usually sending a couple of pings every minute or half minute, even at aggressive thresholds like that, it’s just a couple. So if you’re trying to detect something like 10% packet loss, that’s super impacting to your user base, you would have to send a whole lot of packets. Generally, that’s not how you use ping in a monitoring tool. That’s what IPSLA fixes for you. You can detect from 5:00 pm to 5:30 pm. I had 10% packet loss because IPSLA can provide that level of granularity. Let’s take a look at it.
I was going to say, he’s really passionate about this. We’ve talked a lot about it. So let’s see.
We got to take a look.
So, here’s an example of a map that you can build using IPSLA.
Yeah, and this really speaks to IPSLA’s sort of WAN-centric perspective. A lot of people use it over their long-distance connections, their connections that connect different offices they have in different geographic regions. And then you want to visualize that. So this just happens to be how our tool does that. So let’s jump into a specific IPSLA operation and see what sort of data we get. So there’s a couple of things going on. I’ll talk about the basics here, and then I’ll hand it off to you about the voice-specific stuff.
So, in general, we’re sending a whole lot of packets from one router—the router we configured to send the packets— to another device, whether it’s a router or not. It sort of depends on the operation type. We may get into that.
In this case, it’s going from Ottawa to Austin.
Yes, Ottawa to Austin.
In that direction.
Yep, real traffic going in that direction. And then we can get things like the packet loss and the latency at a very high degree of accuracy and at a high interval rate. There’s a bunch of other things we can get that are 100% unique to IPSLA that feed into this MOS thing. What is that?
So, MOS stands for Mean Opinion Score. When people say MOS score, it’s a MOS score, score, score. Anyway, that’s not important. What it is is, they’ve got a whole bunch of people, real human beings, on phones and they set up calls between those people. And then they inject a jitter packet loss and latency in, and then they had them rate on a scale of one to five how good they said that phone call was. And after doing that with hundreds and even thousands of people, they created this scale of one to five call quality, which is what MOS is. So what IPSLA does—so, let me take a step back. To measure call quality, you have, really, two choices from a monitoring standpoint. Either you set up a laptop in two different places and have those laptops make fake phone calls to each other all the time. And then you have to have lots of laptops and lots of phone calls. And they make the phone call. They test the jitter and packet loss. They ship that data back to the mothership—you know, the monitoring tool. That’s one way to do it. But it’s a little bit bulky. Or, you have the routers themselves do this fake phone call. Now you’re not getting all the way to the end. Right? You’re not getting from the router to the actual desk. But it’s more or less close enough. >>It’s the important bits. Yep.
Right. So those routers are automatically making phone calls. And it also means that, from whatever point in your network to the other point, that you’re going to be supporting phone calls, those routers are doing it anyway. So this is, in this case, on this this Ottawa to Austin connection, we can see that from morning, whatever it is, until 6 pm the call quality was around two. It got better. It went up to a three. By the way, three is considered basic, average, “good enough.” Anything lower than three is considered bad. Five is considered stupendous. So, from 6:00 at night until 6:00 in the morning, so, basically when on one was on the phone, the call quality was three. It was good. And then as soon as people jumped on the call, then it went right back down to one— two, or even one.
Yeah, it’s a really clever way to pull in several different metrics and make something that is impactful to humans. Like, what is the opinion that someone would have if they were on the phone at that point in time? Would they say this is okay? This is good. This is great. But that pulls in jitter, pulls in packet loss, pulls in latency. It’s fantastic.
And those are all the things that would affect any sort of sensitive program—which isn’t just phone, by the way. It’s also your video. It’s also—there’s a few other applications that are extremely time sensitive in terms of when things—
Right. So that’s one thing. However, in talking about use cases, I want to point out, as you mentioned earlier, that this is good for other things besides just phones. So anybody who comes to you and says IPSLA is just for monitoring phones. One example of that is you want to know whether your DHPC server is responding appropriately. This is a pretty common thing. And lots of people want to monitor this. It’s an application-type monitor. There’s a problem though. Your DHPC server will only respond to requests inside that network segment. So if you’re at the home office and you have a DHPC server in a warehouse, you can find out if it running. You can find out if the service is running. You can find out—you can even query and say, “Do you have IP addresses available?” But is it really handing out addresses? You can’t do, because your monitoring server is over here and the DHPC server is there. Wait a minute. DHPC is one of the SLA operations. You can actually have the router at the warehouse ask the DHPC server at the warehouse, “Hey, do you have an address I could have?” It’s not going to get one. You’re not going to use it up. It says, “I’m looking for an address.” The DHPC server says, “I got one for you.” And then the router doesn’t say anything. It doesn’t actually complete the transaction. So that’s one use case. And other IPSLA operations include checking DNS, checking FTP or TFPT servers. You can do UDP, TCP, ICMP, ICMP path.
There’s a whole bunch of them. There’s really a lot of functionality there. Right. So again, this is an advanced topic. But it’s something that even in a monitoring 101 context, you need to know that it exists, and that it is a technique that is well worth looking into.
All right, onto the next topic, which is Config Backup. Another one you’ve been waiting to talk about.
A subject near and dear to my heart. So, there’s a number of things this technology can do for you, but the first one is, you know, it’s 2 am. I get a message from monitoring that my data center is down. And I go and look at monitoring, maybe drive out to the data center and find out a device is broken. Right? So I call up Cisco in a frantic tone, and I’m like, “Hey, get me a device as soon as possible. I’ve got the four-hour SLA.” So they ship it out via courier. I’m sitting there at the data center like okay, it’s coming. It’s coming. We’re going to be okay. And then the device gets there and I unpack it, and rack it up, plug in the cables, starting to think about configuration. I plug in my console port, and I’m like, oh, I kind of have to configure this, don’t I?
To do all the things that it was doing before, all of the QOS settings.
Now let me see if I can remember the entire config without error off the top of my… no. You can’t remember that. There’s no one good enough to remember the entire large config. Configs are a live thing. As time goes on, you do all of these incremental changes to get them where you want to be. You can’t just remember that. You can’t configure it on the fly. That’s a living document that you’ve built up, and you need to protect it. Config Backup does that for you. So, the basics of how Config Backup works. Let’s start jumping into the demo here.
So, we’re looking at good old BOWAN which I have to say…
…is a branch office. It’s not—no, it’s not that. No.
All right. And I go down to configs.
So in this case, we’re backing up the configuration every 30 minutes. You can do it every hour. It sort of depends on how often you make changes. My device went down in this scenario, and I go to the configs tab. And look at that. I got my config. That’s like gold in an outage.
Right, and it’s here even if the device is now—it has exploded and is now a pile of bubbling molten metal. It doesn’t matter. You’ve got this information here.
So, click into that guy. I’ve got to see it. I’ve got to make sure— by the way, when you’re setting up config backup, you want to click and look at the results, make sure you have it there.
There it is.
There’s my config.
And you can literally— I mean, worst case scenario, you could literally highlight, copy, and then in an FTP session on your new device that came in, with four-hour service, you could just paste that in, and be good to go. You don’t have to, but the fact is, you could literally just do that.
Yeah, it’s fantastic. I get such a sense of “everything’s going to be all right” when I have that Config Backup. But that’s not the only thing Config Backup is good for.
No, it’s not. So, there’s a few different things. First of all, just like you can back up the config, you can push the config out. So really, in that disaster scenario, you know it’s broken or you get a phone call at 2:00 in the morning, “Hey, this thing just completely is belly up. What am I going to do?” You say, “No problem. Go grab the spare off the shelf,” because we are thoughtful network engineers who have spares. You grab the spare off the shelf, rack it, configure it with this IP address. That’s all you need to do, because then I can actually go in and restore. I’m going to go back one screen. And again, it has the right IP address. It’s the right kind of box. That’s really all I need. I can go back. So you don’t even have to copy and paste it. You can, at 2:00 in the morning, you get the phone call. You say, “No problem, just get the spare off the shelf,” because we are network professionals. We have spares waiting for things like this. Take the spare, rack it, plug it in, get the cables all plugged in. Then you want to give it the right IP address, because that’s how I’m going to get to it. And then on this screen, on this config screen back where we started, down at the bottom, I can upload a config. I can say which version of the running config. I can click upload. I’m not going to right now. And it will push the config right back out again, so I can push it back.
But there’s another thing it can do, and that’s compare.
Even—but that’s not all. So you’re backing up your configs every half hour, every hour, every day, whatever it is. What you can do is automatically compare the one I just backed up to the one I had before, or to baseline. It should look like this all the time. And I’ve got a picture of what that looks like. And then with highlighting— all right, so yellow means it doesn’t really matter because it’s time stamped. But oh, wait a minute. Somebody added a QOS policy in here that wasn’t there before. So I can compare these two configs, and then I can alert on that. I can send a message saying, “Hey, this changed.” If you didn’t have a change control and you weren’t expecting it, someone’s messing around with your stuff.
Yep. And it basically constantly keeps you aware of what is changing in your network’s configuration, which is a super-important question to be able to answer. And I think as network guys, we are very fortunate that vendors have chosen to really—unlike Linux and UNIX and servers, Windows servers, and what have you—network devices have a config file, which really represents 95% of the configuration state of that device. So it is really something that, anything changes in your network in terms of configuration, it’s in this file somewhere. So keeping track of those, being able to diff them, super powerful.
Yeah, very powerful.
Another thing I want to mention, in case you’d like to think about yours for a while. Another thing I’d like to mention is, we talked about the polling interval. You can do every 30 minutes, hour, every day, for people who don’t have the configuration change very much. Another thing that is common in lots of tools is that you can set up triggered config backup. So if you, using the syslog functionality or the SNMP track functionality, rather, that we talked about earlier, if your device tells you, “I’ve had a configuration change,” someone did right then, then that can trigger the monitoring solutions to go and poll a copy of that configuration so that it’s always very up to date and your time stamps for when the config changes is pretty accurate. Another great thing.
Right, and that gets us into— I always joke that network configuration management is like a gateway drug to software-defined networking because once you have the—you know—you get a trap. It’s changed. Okay, back up the config. Okay, analyze it. Well now, you can do automated other things. For example, if the change wasn’t approved or wasn’t expected, you can actually take that device and now quarantine it using a configuration change. You can also check to make sure that there are policies, which we’re not going to talk about today, but there are policies that are in place so you can push, not the whole config, but a piece of the config in so it fits the policy rules in your organization.
We’re trying so hard to keep it on the basics. There’s just so much good stuff you can do once you get the basics of config backup going. Just get that going.
Right. And that’s true of all the monitoring that we talked about today is that once you have the basics of it, then the sky’s the limit, and you can do almost anything you want to do. There’s obviously a lot more to monitoring than just that. Sure, there’s log files, perfmon counters, and even more esoteric solutions like vendor-specific APIs.
Right, not to mention bare-knuckle scripting solutions. But this is definitely, I think, a good start.
Yeah, and one more reminder that you can find everything we talked about with even more detail in the Monitoring 101 eBook. Right, and speaking of reminders, I want to remind everyone to visit www.lab.solarwinds.com to get updates on new episodes, as well as leave comments on what you’d like to see us cover in the future. For SolarWinds Lab, I’m Leon Adato.
And I’m Chris O’Brien. Thanks for watching.