Whether you’re new to monitoring or trying to show the ropes to someone who is, sometimes the hardest part of understanding the technology is figuring out what’s actually going on behind all the graphs and buttons. Head Geeks Leon Adato and Destiny Bertucci sit down to talk through the nitty gritty of monitoring, and separate the hype from the hypervisor. Leon and Destiny continue the discussion from last year’s “Monitoring 101” episode and focus on server-centric monitoring techniques and protocols, how each one works, and what the real-world use-case is for them.
I’ll get right to the point. After the Monitoring 101 episode last December, there were a few comments.
A few? I’m still hunting down the answers for the stuff you guys were asking.
Okay, okay. More than a few, and that’s great, because it tells us that you’re really interested that we think is important, too. Learning about monitoring not just as something you do on the side of your real job, but monitoring as your job.
Right, and after doing this for over a decade, I can vouch that monitoring is its own skillset. The same as being an expert in storage, networking, security, or virtualization. So here at SolarWinds Lab, we want to help people dig into the area of monitoring by giving you both the advanced stuff.
We mean like the Orion SDK, connecting a ticket system using Rest, running alert actions as scripts.
Exactly, but we also want to talk about the fundamentals. Now the last Monitoring 101 episode was really network-centric. So this time, I want to dig into the server side of things. So without any of the waffling you guys usually do around here, let’s get things started. Ouch, but you do have a point. Okay, I’m Leon Adato.
And I’m Destiny Bertucci, and welcome to SolarWinds Lab.
Now Destiny mentioned the questions that we got from the Monitoring 101 show. And those came in during our live chat session, which you can see over here. If you don’t, you should head over to lab.solarwinds.com and sign up for reminders, so you can join us and ask questions in real time. You can also register for upcoming episodes and leave comments about what you’d like to see us cover next. Oh, and everything we’re going to talk about today, you can find in a SolarWinds Monitoring 101 eBook, which is available on THWACK.com as well as Amazon.
Remember that waffling thing I said you guys do around here?
You’re doing it right now.
Oh. [SOUND] So the first thing I wanted to get into is something that we actually covered in the Monitoring 101 episode. Just very lightly, which was WMI. Now we just talked about what WMI was, but because that was so network-specific, we didn’t really get into it. What I’d like to do, the same as we did with Ping, is I want to show what WMI is. Now WMI stands for Windows Management Instrumentation and obviously, this is a Windows-specific thing. You can’t do it on Linux, and you can’t do it on a router, or what have you. But what it does, is it gives you access to a Windows box and a lot of the different controls. But…
So it’s scriptable.
It is absolutely scriptable. And that’s where we’re going with this. Is that it’s something you can do with the command line, and you actually should. I’m going to give you a use case for why you might want to do that. But, just like Ping, you should know how to do it at the command line. You should understand what’s going on. Because it’s not voodoo. It’s not some magical special sauce stuff. So here I am, on a box. And you can do this on any Windows machine; really, you don’t have to have anything special. WMIC, K-E-Y M-O-U-S-E From here, if I want to do a regular WMI command, like I want to get the disc information. It’s logic, logicaldisk. There I can see my disk information.
So pretty much, you’re asking it something and it’s giving you an answer.
Exactly, it’s giving me…
Just like we do for SNMP and other things that we’re doing. It’s just simply scriptable information that—I have a question and I need an answer from my Windows box.
Precisely, and you can be more specific. It’s like a lot of command line things. Here, I just said logicaldisk and it gave me a lot of information. You can see it goes all the way out to the right and then comes way back. So here, I want to be more specific. I’m going to say logicaldisk. Oops, spelling counts. And I only want where it’s a hard drive equals three. And I want just a few fields: name, size, free space. And again, you can see where this would be really scriptable, right? You could run a command every five minutes or whatever and just pull back the free space, or parse it out, or whatever. Now, this is local.
I could do it against a remote box, also. I just need to add /node and the node I want is ‘dev OS my machine 02.’ There we go. I need to give it some authentication. Very secure; I’m using the administrator account. And what command am I running against that remote box? Logicaldisk. Enter. It’s going to ask me for the password. So there, you can see that I ran the same command, logicaldisk, against the remote box.
So you’re being your own little monitoring setup yourself.
Right, exactly, and I could keep typing this really, really fast, over and over and over again, and just keep on watching it. Nobody would want to do that.
Right, now, the time when you would want to do this, though, is if you’re having a problem with your WMI monitoring tool— SAM or whatever is that you’re using, and you’re not quite sure what’s working with it. You can actually go to the command line on the box and try it against that remote machine, and see what errors you’re getting and do troubleshooting. The same way you would troubleshoot ping or SNMP, if you could do it from the command line also. So this gives you insight into what’s going on. However, as you said, you probably wouldn’t want to do this all the time. So, what does this look like in the real world? Same stuff right there.
So here’s exactly the same stuff. So you’re already seeing that your disc space—you’ve got trending, you’ve got everything you want on your logical, all on one page. And it’s doing this all in the background for multiple devices. So instead of asking all of these questions and going on and on and on like Leon does.
This goes out and actually does all the information for you, and trims it up so that you can have all your charts, reports, and learning, anything of which that you need.
Right, and right, it’s doing every five minutes, every one minute, every 15 minutes. Whatever cycle you want. That’s the point of having a monitoring tool, is that it’s just doing a lot of the stuff you could have done yourself, but you’re too darn lazy to do. But also, now we did the disk space, but this is good for network bandwidth. And bandwidth information, errors, packet loss, CPU, RAM. Really any of those metrics. Hardware, inventory, any of those things that you can ask the machine for, you can pull off.
And so something that’s also really great about that is that we’re going to sit here and we’re going to gather all of this information for you and give it to you in an actual visual aid. When you’re looking at this, and you go out and see these and they constantly stay at a certain line.
And you’re going, man, everything is always at 80%. How is this valuable to me? Well, that’s your normal. So that’s where we need to make a basis at. This is my normal box. It’s normally at this location; this disk is at this amount constantly. But what do you do when it changes? That’s when you start setting precedents for, when do I need to be alerted? When are these things just outside of normal? You don’t know that if you don’t have a normal baseline that you have actually within your corporation or company that’s using it.
Exactly, and that’s where a tool like this is better than sitting at the command line, typing the command every time.
Exactly, and you can go back in history, so I mean you can figure out things that happened in the past. You can let things build upon. If there’s things that you’re not wanting, not aware of, don’t monitor them. You don’t have to do those. But they’re already there in the background for a reason. So for historical, for trending, for future forecasting, things of that nature, it’s best to keep them going so that your monitoring solution is whole.
Right. [SOUND] So what’s next?
So these are numerical values that you can pull off of your servers.
All right, and it doesn’t sound particularly complicated.
No, it’s not, kind of like the WMI where you ask something and you get an answer. Well, now you’re going to ask something that’s going to give you numbers. So what we’re going to do is we’re going to add a couple in here. I want to point you out; this performance monitor utility is something that we’ve seen all the way back since Windows NT. So if you’ve been using servers or managing servers for a while, you should be comfortable with the idea of them, but obviously, what we’re doing is taking this into a monitoring context.
Correct, yeah. And for right now, we’re going to actually do the process queues.
Okay. And so we’re going to add the processor Q links here.
One of my favorites.
It’s everybody’s. I mean, you can use this in several locations. So when you can use the processor Qs is, if you’re monitoring anything that’s getting backed up, something that’s going through there, anything you need to keep track of, that is where we would use these at. Okay, and then I’m also going to add the network adapter.
We’ve got to get networking here somehow.
You have to.
You have to. Okay, and so here’s the network adapter and we can choose which one of which we want to add.
Yeah, and some PerfMon counters are just raw, they are themselves. But some of them have subcomponents. So here you can see that you can get the overall bits per second, or you can get the ones for a particular adaptor or sub adaptor, whatever you want.
And we’re going to pick a particular one.
Okay. Which one? That one?
Okay, and so then, that presents you a graph. So, this is automatically gathering the information and putting it out here. Kind of like a performance monitor tool. [LAUGH]
So, if you think about the network performance monitor that we have, that’s what it’s doing in the background. It’s gathering all these countered and numerical values, and it’s presenting them so we can chart them and do things.
So let’s kind of show you what that looks like. Okay. So here we have the processor queue link that we were talking about earlier. And you can see that we can actually do it by a graph, you can have the information about it, which one it’s on, where it’s at, system categories. And it shows you everything that it’s monitoring.
And this is how all PerfMon counters generally are going to work. You’re going to collect them and use them either individually to indicate that there’s a problem in one particular area. Or you’re going to use them as an aggregate, combined totals. For example, I’ve done this a lot, but I like to use processor queue length along with the CPU utilization because, if you have jobs that aren’t being processed and you have high CPU utilization, it could indicate that there’s a real bottleneck somewhere in the system. So that’s where you need to be using the CPU counter along with the process queue link counter together to give you a piece of information.
And then you can also use it because if you have processor queues and jobs filling up, and you’re noticing that services are still up, but things aren’t coming through, there’s your problem. I mean, something is messed up and you’re getting backed up, so they’re not actually sending out, and it’s just building, and building, and building. We don’t want to build it, and build it, and build it until nothing’s working and you get all this information lost. So it’s good to keep a tab on things that are going on with servers. And things that are using things like MSMQ, things of that nature, so that you know and have a handle of how that server is actually using them.
Great. Now, this is one that’s already been done. What does it look like to add?
Okay, well, let me show you. So you go into the SAM settings and you can go to component monitor wizard. And then naturally, we have it by default, where we already have performance counters, because some would think that these are kind of used a lot.
It’s probably the one thing that you go to more often than any of the others, yeah.
So you can hit next here. Okay, so we’ll put the server IP address in here and then we’ll figure out which credential of which we want to do to access the…
Now this actually happens to be a node that’s already added. It’s already added as a WMI node, so all you have to do is say, “Inherit it.” All ready. It’s one of the benefits of using WMI monitoring for your WMI nodes.
And then you also have the option of 32- or 64-bit counters.
You always want 64.
You do. Faster.
Mm-hm. And they’re more robust.
Faster, better, stronger, whatever you’re saying, all that.
Got you to do it. Okay, so now it’s going to be loading your performance objects, so that we can choose which ones that we want to do.
Yeah, you still don’t have to necessarily know them. You can still hunt for them if you’re not exactly sure. The categories—PerfMon counters are broken down by category and then by the specific counter, and then, like we said, the sub element, if that’s what we want. So here, you’re looking through the categories first.
And this is just like the little applet that we had up earlier—the GUI, right? So this is showing you pretty much everything that you see there, but in a little bit more user-friendly way and a way that you can actually send it here, and go out to mass quantities of servers and devices. So here is the processor key link here that we can just check and hit next.
And from here, we’d be building an actual SAM template with this and any other counters that we wanted to do. Or make any adjustments in terms of polling frequency or the thresholds you were talking earlier about, the baselines. So this where you can actually put in baselines, down below, where you can say that normal is going to be 70% and alert is over that, or what have you.
And the nice thing is that you can actually have and click the checkbox for ‘use thresholds’ calculated from the baseline data. So that way, it’ll actually know, hey, this is the normal that we have. So now, we’re going to calculate at a threshold that matches this for you.
Right, and this also emphasizes why a monitoring solution has some benefits over and above just using the built-in PerfMon counters or built-in WMI queries. I mean, we all know, if we’ve been doing this for a while, that I can remote and do PerfMon counting and have it running on my desktop, and keep an eye on things, or whatever. But being able to do things like this, where you can say, I don’t want an alert when it’s 80 or when it’s twelve; I want an alert when it’s 5% over whatever normal is for this particular box, is a huge benefit.
Sometimes you just have devices that run hot. Sometimes you just have devices that from day one when you get them out of the box 98% CPU load, no matter what’s going on, it’s 98% CPU load. Well, that’s normal. I mean unfortunately, that is the new normal for that device. So get used to it. But it would be great if you could get an alert outside of that normalcy so that you can actually know when you should be concerned with this device. So then we can actually go through here and when we hit next, if we decide if we want to do the thresholds or not. It will save your changes and, like you said, it’s creating a template. We can add the template name here, and then we can assign it to more than one node, and confirm and send it out.
Perfect. So, the next thing—and this isn’t going to take a ton of time—is Windows Event Log. Now, again, if you’ve been using Windows for any length of time, you probably are familiar with it. We’ve got regular old event viewer.
However, it’s always been kind of hard to manage those and monitor them.
Well monitoring has had a difficult time getting into it, but with the magic of WMI, we’re able to get in there. Again, here’s event viewer. We’re pretty familiar with application security, system event logs. The idea here is that you can pick up these messages, parse through them, only alert on the ones that match, the particular message, number, category, and so on and so forth.
So if I get more than three in and say well, maybe I should actually alert on that?
Okay, so we’ll get into that. That’s where things get, I would say useful, because just alerting on one event log message isn’t necessarily what people want to do. So here, I want to point out that— there we go— we can monitor the event log messages as an object. And what we have right here is just standing by itself. The event log messages from that same machine. You can see that I have just this one, because that’s the one I wanted to alert on. I’ve got a little thing to capture it. Not so hard. If I look at the node details page down a little bit lower under events— again, if you’re collecting events, they’re there. But the thing I want to show is actually in the template, in the monitor, because it’s just not enough to say, I want this particular event log, I want them in a particular way. So here we have just a test template set up. I have one component event log monitor and the items that I want to point out: here, what credential are you using? Okay, fine, we’re using WMI to pull it, collect it. Which of the log monitors, what definition? Anything that generates a match, and so on. However, if you go from catch anything to custom, then you get into some really interesting options. For example, which ID do you want? I only want specific IDs and you can do a comma separated: 4005, 2171, whichever ones you want to catch. You can also parse for specific keywords. I’m looking for a login authentication error. But I only want it when the username is admin. I don’t care if Fred or Mary logs in badly, you don’t care about that.
So pretty much what you’re doing is instead of having an alert for 5,000, 4005 errors in your event log, you’re just wanting specific ones. Which means that I’m not getting 5,000 email alerts or 5,000 alerts, or my whole page just has an event log that’s filled with a basic event.
Exactly, and the other thing that you can do—part of the problem with event log monitoring in the past is, how do you know it’s all better? Or, what happens if it’s not all better? I’ve gotten a few, it’s been quiet, I get a few more. Well, this option down here, the number of past poll intervals to search for events will search back across one and a half, two, three, four polling cycles and look for those, so that if I’ve missed it, it’s there. But also, when you get past that, it resolves itself again so that if you get 500 messages in a row, you get one alert, not a bunch.
Which is huge, especially since when you have a huge enterprise monitoring software system, you’re going to have a lot of alerts. So when you can actually pinpoint them down, instead of just up down, which a lot of people still leave it to. But you can pinpoint it down to, “This is serious, we need to do something about it.” Or, “This is not serious, so we’re just going to let it go.” When you do this, it actually helps you save time. You’re also not creating all that white noise that everybody talks about. And so, an alert becomes an alert again. Because currently, all I hear is people talking about yeah, an alert went off. Well, because they’re so normal, it’s not an alert when it’s normal.
Right, right, it’s just noise.
So we need to stop that.
Yeah, exactly. Now, talking about Windows Event Log leads me to talk about just log file monitoring in general. And I always have a reaction when I’m with a customer and they say log file monitoring. My first thought in the back of my head is that word; you keep using that word. I do not think it means what you think it means.
There’s so many different things that that can mean, it’s just like there’s a plethora behind that.
Right, right, so, I want to run this down. Okay, so when they talk about log file monitoring, it could be Windows Event Log, which we just covered, right?
It could be Syslog.
Which we covered in the monitoring 101 session, right?
Could be log aggregation.
Log file aggregation is its own little beastie. And just briefly, what we’re talking about is where you’re collecting log files. Text log files, Windows log files, syslog, trap, into a central location so that you can then dumpster dive through them, looking for trends. It’s not that you’re looking for one login authentication error on one machine. You’re looking for the same account hitting multiple boxes, but you’re only able to see that if you aggregate everything together. They could mean that. Or they could mean simple, I got a text file, I’m looking for the word error inside of it, right? That’s the one that I want to look at now. But when you hear the word log file monitoring, you need to make sure that the person who’s talking to you is specific. Because otherwise, you’re going to run down a rabbit hole that you just don’t need to run down.
Now, here’s the interesting thing. SolarWinds, as a tool, doesn’t have a built-in separate text log file monitoring technique.
But as customizable as we are, and with the THWACK community, there’s several options that you can. You can either create one, or you can download one from THWACK.
Precisely, so that’s what I want to do now. I just want to show everyone quickly that you can go into the template area in SAM. There’s a tab specifically that you can search THWACK for. And I’m just going to look for log file and search. And look at that. As with almost every category, aLTeReGo has one. He has written quite a number of things for different areas. So this log file parser, log file monitor for using PowerShell, using Perl, using Visual, and so on and so forth.
But then another reason why he put parser there is because he understands, too, that log monitoring can take on several different forms. So when he says parser he’s like, we’re parsing through the log file here.
So he’s also helping you there.
Exactly. So I’ve got it over here. I’ve already downloaded it, and I just want to take a quick look. Just to show the customizability of it, and also the kinds of options that you might want to have. For example, I might want to look through a log and find the total number of strings found. Here, I would indicate what the name of the text file is that I’m looking through. In this, it’s PowerTest. But you put your text file and your drive letter there. You’d also put in the string that you’re looking for. The word total says to give me the total number of strings. And then there’s a logic element there to say whether to error or not, or whether it’s found or not found and so on and so forth. So this is all scripted. This is something that, again, aLTeReGo just wrote the code for. But the goal is log file monitoring. The goal is to take this text file, look through it for the word error or whatever. And then give me a message when that number of errors is greather than five, for example.
And that goes back to the whole filtering down on what’s important for alerting. When you’re able to get this customizable and actually get into well, if this starts ‘erroring’ here, I know this is going to happen later. So if I preemptively look for an error code or something that’s going on within a text file. Even if it’s a log file, like he says, from a service. Even if it was with SolarWinds service, anything that way. If you’re looking for pinpointed information and you know what the outcome is, you can prevent it from having a disastrous outcome.
I mean, there’s simple ways that you can actually drill into these and get the information you need—only. You don’t need to know that there’s a log file. You don’t need to know that there’s only just a couple of words in there, or warnings, or just a lot of text, and bring it back. You need to know there’s an error and the specific reason.
Right, and one cautionary tale, I keep on saying if you find the word error. I just want to remind everybody that when you’re doing any kind of monitoring, make sure you know exactly what your message is, because I have been there. The customer has asked for; give me an alert whenever it says error. Not realizing that there’s another message that says no error found. And they got an alert every time that line came up. So just—you want to make sure you know what you’re looking for and that the tool is flexible enough that you’re able to get what you need. [SOUND] What’s amazing to me is that all of this, what we covered today, plus the last episode, is still just scratching the surface of what you’re able to do.
Yeah, it’s still not even everything that’s in the Monitoring 101 e-book.
Right, well, but I wrote that to be really comprehensive.
No, I just meant, you talk a lot.
Right, well, let’s hope everyone out there likes to talk a lot too, and is asking questions and making more suggestions over in the chat window. Which reminds me, I’d like to remind you that you need to…
I’m pretty sure that they got it the first time. Come to lab.solarwinds.com, sign up, and hang out with geeks like us. So with no further waffling, I’m Destiny Bertucci.
And I’m Leon Adato. Thanks for watching SolarWinds Lab.