Sure, Windows is the way of the world in the data center these days, but there are plenty of shops that have a solid number of Linux and UNIX systems as well. There are even datacenters where Microsoft fear to go. Does SolarWinds have anything to offer for servers running a Tux-based OS? Of course! Join Head Geeks Patrick Hubbard, Leon Adato, and Kong Yang, as they dig into the wealth of monitoring features built with Linux in mind.
First, they will look at features like real-time process monitors and alerts that grab the top-10 processes, which work as well on Linux as they do on the Windows systems that made the feature popular. From there, it’s a deep dive into the custom poller and ways it can enhance SolarWinds’ view of UNIX and Linux systems. Finally, Leon revisits his first SolarWinds Lab episode to revise the “Ultimate CPU Alert.”
All right, you guys ready to get this Linux show finally started?
That’s a Raspberry Pi, isn’t it?
It is. I took it off the coffee maker this morning. My wife was kind of upset, but it was only ten bucks more than your shirt.
Okay, all right.
You’ve been waiting for this, haven’t you? Yeah, and more than that, I think you guys have been waiting for this based on your feedback. If you come by our homepage, which of course is lab.solarwinds.com, where you can sign up for reminders about upcoming episodes, you’ll also notice lots and lots of comments and suggestions for upcoming episodes, and one of the things they have been asking about literally for years is a whole episode dedicated to Linux.
Mm-hmm, so I guess I’ll get things started. I’m Ubuntu Man! Uh, I mean, I’m Leon Adato.
I’m Kong Yang.
I’m Patrick Hubbard, and welcome again to SolarWinds Lab. I’m always really surprised when people say, “Why doesn’t SolarWinds support Linux?” And there’s a really huge misconception that we’re entirely Windows based, when the reality is that more than 60% of our products, in fact, run on *nix. For instance, the SAM hardware health monitor supports Linux, as well as NCMN Asset Manager, and inventory, and I mean, that thing will even go do AIX.
I know. Virtualization Manager runs on CentOS, with PostGres SQL, database, and Tomcat web servers. Storage Manager, Web Help Desk, and Alert Central all run on Linux OS.
That’s right, and I would of course be remiss if I didn’t point out that the Swiss SDK has a bunch of Linux libraries pre-built. And you guys know that REST + CRUD means no PowerShell.
Right, and we also have the metric asterisk-load of Linux and Unix applications for SAM that are capable monitoring virtually any major application running on those servers. So here’s–the list includes: Cassandra, NFS File Share Availability, Apache ActiveMQ, EngineX RabbitMQ, OpenVPM, Samba, MySQL, JBoss Application Server, ClamAV, Bind, GlassFish, MongoDB, GroupWise, Postfix, PostgreSQL, Squid, Tomcat. [Gasps for breath]
And if you sign up for the radio club, you get that for a dollar.
None of that runs on Windows. And they’re all built into SAM. And I think you missed a few on the screen.
Well, right, but we have a lot to cover in this episode, so I’m just trying to pace myself.
Yeah, well, let’s do try to do that, because you went like 22 minutes on the NPM alerting episode. And let’s make sure that we do that, so do be concise.
Okay, how about being non-verbose?
I think you’re in good hands with these two, so I’m going to finish tweaking my next show on monitoring essentials, and, I’m out of here.
Leon, do you mind if I take Patrick’s rant on Linux a little further?
Absolutely, be my guest.
I want to point out just two areas to show how Linux monitoring is equal to what we do for Windows. Process Explorer and the Top Ten processes alert.
All right Leon, so let’s jump onto the system that we’re monitoring. It’s a DHCP, and if you go into management, Real-Time Process Explorer. Click it. Here is the Real-Time Process Explorer, and it looks a lot like Task Manager. So, folks in Windows realm, this is like your Task Manager, but presented to you in the Orion UI.
Right. And you can create new monitors right from here, and it’s available for Windows nodes, as long as you’re doing WMI monitoring of some kind.
Now, let’s hop on to a Linux system and show the parallel. So here we have a SUSE box, and you brought up a great point: the fact that the Process Explorer is not in the management piece, that’s probably where a lot of confusion comes into play.
And so, in order to see that, you really have to scroll down and go into the applications tab, click on your application there, and voila. Within applications detail, there is your real-time Process Explorer.
Right, so, what you’re saying is that as long as you’re monitoring an application— whether it’s through SNMP or whatever. At that point, that’s when you get the real-time Process Explorer that is going to have the same functionality as the Windows piece.
Exactly. And what SolarWinds and the Orion Platform has done is essentially, it’s taking Task Manager from the Windows realm, and normalized it. And we’ve taken Top from the Linux realm, and normalized it, and you get this view.
And once again, you can take this and you can create new monitors off of it. You can see how it’s running, the CPU, and so on.
Exactly, and I love that feature, that normalized the creations, all the actions that are available to any admin. It’s just wonderful. So another feature that SolarWinds allows the IT admin for monitoring, that normalizes it between the Windows realm and the Linux realm, is the Top Ten processes.
You mean getting the Top Ten processes in an alert, like an email, or something like that, where it tells you that this happened and here are the top processes during that time.
Exactly, and it covers all the layers—the compute, the memory, the disk, and the network layers, so all those processes are shown up and rolled up, so that you can quickly see and ascertain the health of your environment.
Right, so you can do that in Linux, also.
Yes, it should absolutely work for what you would expect for Windows, for the Linux environment. Within that email, and we should see it… [Digital chime]
A chart, excellent! We love our charts.
You should be able to see your processes, and their associated high CPU utilization, or high memory utilization, whatever you key off of for those processes.
Got it. So the one they’re looking at now is the one for Windows, and what would it look like for Linux—do we have one of those?
We do. Boom.
Boom! [Laughs] There it is, okay!
And as the folks can see, it’s similar to what the Windows piece had shown us for the top ten processes. So really, we’ve normalized that experience for these top ten processes, whether you’re Windows or Linux.
Excellent, and the one thing I do want to point out about this—getting the top ten processes—is it actually is completely monitoring independent. With the Process Explorer, you needed to have a connection to monitoring some kind of processes on the box first. Again, for Windows, it was WMI, one way or the other. For Linux, you had to be monitoring an application. Here, just basic monitoring, through Software Application Manager, SAM, and you have this capability right there. Okay, so I think you’ve driven the point home. SolarWinds tools are built to support Linux as much as they are for Windows.
Or even Apple, for that matter.
Oh, you mean a thinly veiled Linux distro?
I’ll just stand over here while people throw things at the screen and Google your home address.
All right, fine, fine, I will take on only one monolithic software giant at a time.
Where do you want to go from here, Leon?
Well, a couple of episodes ago, Rob Hawk and I touched on a couple of tricks you can do with the new custom poller feature, so I want to dig into just a little bit more of those.
Leon, you have the com.
Both of these are going to be custom poller changes So we’re going to go to settings, where all the great administrative tools are, manage pollers, and from here, we’re going to create a new poller. But it’s not just a new poller to get some kind of new value or whatever. We’re actually going to replace the value. This is the “Linux ate my RAM” value. And for those people who aren’t sure exactly what I’m talking about, there’s an argument between some sets of sysadmins and other sets of sysadmins about what the proper RAM value is. When you look at top or free, you can get different numbers, and SolarWinds grabs a fairly complicated stream of calculations that includes free memory, user memory, and a whole bunch of other stuff. Some sysadmins want a much simpler calculation, which is just free memory divided by available memory to give me the percentage. So we’re going to do the simpler version of this. We’re going to create a new poller, and most people get about this far, and then things start to break down. The first thing that we want to say is that this is a CPU & Memory alert. We’re going to call it ‘linuxatemyram.’ And then in order to do this, you have to have a test node, so I’m going to pick— I’ll pick this Linux node. This is special, and we’re going to know why it’s special in a little bit here, with the next demo. I could put some tags on; I could put a description. I’m not going to worry about that here. The next thing I want to do is set some data sources. So the first thing I want to do is set my used memory data source. This is where all the magic is going to happen. Now again, people get about this far, and then they get a little cranky about things, because they think that, oh, I don’t have an OID, and I don’t know which one it is, maybe I want one that isn’t listed. So, in this case, don’t get panicky. You will have to know some object IDs, but you just look them up. You can get them from Wire Shark. If you want to see what a system, you can do an SNMP walk, so you can find them. In this case, I happen to know the OID that I’m looking for, and it is .188.8.131.52.4.1.2021.4.5
Yeah, right, I have that one memorized, too.
Yeah, like everyone knows that one. No results, because it doesn’t like the leading period. I always forget that. There we go. I want memTotalReal. So I’m going to add that one in, submit. There I have memTotalReal; I don’t actually want to use that one. I want to add another value. Select an additional OID. 184.108.40.206.4.1.2021.4.6 this time. Which is memAvailReal. Submit. So now I have two OIDs, which again, some people don’t realize they can keep on getting them, and I want one more, which is going to be—I want to do a calculated value. So, the calculated value is going to be one minus— because I want a percentage. I want to get a decimal—so, one minus parentheses memAvailReal divided by memTotalreal. Those two are the values that I got; submit. This is the value I want to use. What it allows me to do— yes, the data source is reasonable—what it allows me to do is re-map the used memory. Now every chart, every graph, every little speedometer thing, is going to use this calculated value for the actual memory values instead of the built-in SolarWinds one. So that’s something that’s been around since version eleven, or even before that, but a lot of people weren’t real clear about it. And for Linux SysAdmins, it means that if you have a very particular calculation, you can get it in SolarWinds as the real data. You don’t need a separate box or anything like that. So that’s the first use of this.
Excellent, that’s very powerful, to be able to customize your calculations for your specific need.
Right. Now I’m going to back out of here. The next thing I want to do is a little bit trickier. The challenge here is that I want my Linux boxes, which all show up as net SNMP. I want them to show up as Ubuntu, or SUSE, or whatever. So this takes a little bit more set up. So the first thing you need to do is actually get into one of my Linux boxes and just show you what I did. I created a script that will respond with the operating system name in some way. It could be as simple as hard coding “I’m SUSE!” Or whatever. I did a little bit more sophisticated, if you want to take a look. I have this script called osname.sh. It uses the lsb_release command it does some awk to take some of the pieces out of it, and when I execute it, user, bin, osname, it responds with SUSE Enterprise Server 11. But as long as my script works consistently across the environment, and it’s always named osname.sh, I can use this on my other boxes: my CentOS boxes, or my AIX boxes or whatever, as long that script responds similarly in each case. Once I had the script working–which is no mean feat–but once I had the script working, then I had to update my SNMP client to do that. Here’s what I mean by that. If you take a look at etc/snmp/snmpd configuration, you can see that over here I have extended, extend osname, the word, osname, usr/bin/osname.sh. Okay, so, if I do an SNMP get on the osname value, it will respond by running the osname databank, which means it’s going to respond by saying SUSE Enterprise 11, or whatever it’s going to say. Then, and I’m not going to do this one here, then I have to do an SNMP walk on the box, to find out what the object ID is. Once I have it, then I can go back here, and you can see, I’ve already set it up to the Linux OS name. Now, I want to edit this. And this is something, again, with the manage pollers that people get stuck on, which is, “How come I can’t edit it, I can only duplicate?” Well, that’s because we have a node assigned, and it’s using it, so you can’t edit it midway, and start giving it different values. What you need to do—I’m going to open another tab here, because I love my tabs—is you have to unpoll, unmanage, whether you do it one or a bunch, you have to unmanage it. Go back here, refresh the screen, now I can edit to see what it is that I did. All right. What I have here is I have to pick my test node. That one. So again, just like we saw, with the “Linux ate my RAM,” you have the name, you have the title, you have the all that stuff. Next, and here there’s a couple of things I need to do. The first one is, the machine type, which is really what I wanted. Notice that it is pulling the OS name, which is really this OID: 220.127.116.11.4.1, which everyone should know is the normal start. 8072—I found that out with my SNMP walk 18.104.22.168.1.1—okay, you’re going to have to dig through your environment. You’re going to have to find out what that number is. It make take a little bit of cajoling, but you can get it, and it responds with SUSE Enterprise Server 11. But that object ID now responds with the OS name. Just cancel to go back. The other thing you want to change is the vendor. Because once you change the machine type, it doesn’t know what vendor it is. So in this case, I actually just hard coded the vendor formula to say Linux. Now, I could say anything else. Remember, my goal is to use this OID everywhere— to use it on my Ubuntu boxes, on my SUSE boxes, on my Mint boxes, all the rest of them. So I don’t want to say anything too specific, but in this case, I’m going to say Linux, all right. That’s really all you have to change, those two elements. Yeah, I went into the SysObjectID and I kind of played around with it, but that’s not necessary. It’s those two, and when you change that, what you see—I’m just going to go into Settings, Manage Nodes. If I look at Vendor, notice that I have one Linux node. That’s coming from the vendor, and it even picks the right icon. I did not do that. It automatically found that, which I think is wonderful. But, if I go to machine type–which really was the key thing–machine type, you’ll see that I have a SUSE Enterprise Server 11 machine type. So again, just to summarize, what I’m able to do is., I’m able to put a script on all of my Linux boxes that responds with the OS, whether it’s hard coded or not. I update my SNMP so that it executes that same script, which means you can use the same SNMP configuration across the board, assuming that your SNMP configurations are standardized. Then, at that point, I can create a custom object to pull that value. And now, instead of having to create custom new groups– and when I’m on-boarding new servers, I have to hard code it into a custom value, or whatever it is. It will automatically pick that up. And you can do that for a variety of the system information about the box, all by utilizing what’s built into Linux, which is the ability to run a script when SNMP executes a certain number.
Excellent, very powerful, to take all that custom info, you know, create that, present the data, and visualize it into one UI.
It’s really amazing how remapping custom OIDs, which was really just an enhancement of the existing universal device poller feature, can have an impact on the data center.
Yeah, it does. There’s a lot of unsung or under-sung features like that lying around the toolset. Now, I know you’ve been really a stickler about keeping things tight as far as show times, but I think I have time to squeeze in one more demo.
Sudo, make it so. Number one.
So, for some people that have been around for a while, back in episode 18—which was the first episode I appeared in, I talked about the ultimate CPU alert. And there, what I mentioned was that high CPU, by itself, is meaningless to almost every sysadmin worth their salt. They don’t care about high CPU, because it gives high CPU, but the box is running steady in terms of processing jobs. That’s called “I correctly sized that sucker,” you know that from the virtualization world. It’s only when you have more things waiting to be processed by the CPU than you have CPUs, and CPU is high. Now we’re in a critical situation where we’re just not able to keep up with the list, right?
Okay, so, with the Windows version of that, there’s a PerfMon counter that’s called the processor queue. You can see what’s in the processor queue, you do a little bit of SQL magic in SolarWinds to find out how many CPUs the box has, you compare those two numbers, and you get your high CPU. You compare all three of those, and you’re good. Same thing for Linux, except in this case, with Linux, the thing that you need is the load average. Load average is the number of jobs waiting to be processed by the CPU. So the first stop on this train is actually back on the server, back on the polling engine, in good old universal device poller, like you mentioned, and we have two things that we need. The first one is the load average 15-minute value. I typically take the 15. You can also take one or the five, and that’s this object ID here. I know we’ll all memorize it: 22.214.171.124.4.1.2021.10.1.5.3 Three is the 15-minute average, .1 is the one-minute average, .2 is the five-minute average. And for those people who are scribbling furiously, this is all on THWACK. You can actually look up the Ultimate CPU Alert. It will be linked in the show notes from this episode once it posts on YouTube and elsewhere. But anyway, this is the OID. Once you have this OID on and assigned, to a number of boxes, you can see I have assigned down here, then you also do a transform. Now, I want to play with this transform a little bit. There’s not much to it. All it’s doing is taking the 15-minute load average and dividing by 100, so we get a decimal value rather than the un-decimaled version of it. That’s what that’s doing. So we get 1.45 as opposed to 145. So that’s it. Once I have this on, and assigned, I know how many jobs are waiting to be processed in the queue. And I’m good to go from there. The last bit, honestly, is to go to my alerts, and you can see I already have it set up here, the Ultimate CPU Alert, and it takes a little bit of SQL queries. Now, if you’ve done the Windows version, you have almost everything done for you already. It’s almost ready to go. Here is—it’s just a little bit of SQL, it’s not…
Just a little bit.
It’s not that bad. What’s going on here is that I’m joining the custom poller status, because I need to know what the status or the rate of the poller is. And I’m joining my nodes table in here. This block, this inner join, to C1 is getting the number of CPUs. I’m taking the CPU multi-table, and querying it, and finding out just how many CPUs are listed, getting a count, four, five, ten, whatever it is. So then, I have where the poller name is, the average of 15 minutes, that’s what we did in the universal device poller. CPU load is greater than— okay, I said two because I was testing. You’d want that to be 95 or something real, and then also, and again, I was testing this, custom poller status rate, that load average rate, is greater than or equal to the number of CPUs, the CPU count. Once I have this—again, I know that I’m looking for when the number of jobs waiting for CPU is larger than the number of CPUs, and the CPU load itself is greater than, let’s say, 95, and at that point, I’ve got my alert.
Now, Leon, that’s a good balance, right? A good balance of high value, right, that exceeds your capacity for CPU pieces, but it’s over a long enough period of time that it should concern you as an admin, versus just a spike, which may go down, and so forth.
Right, and here you can see the email that I would get. Again, it was a test, so the values are a little bit weird. The CPU load is 2%, the number of CPUs on the box in question is 16, the load average is .01. So, again, because I had reversed the values, it is less than the number of CPUs, and is under 2%, but once you have your greater-thans/less-thans there, you can see that this is really what you want to trigger on. And you’re able to get that.
So, did you guys cover everything, all things Linux?
Now, Patrick, what’s our target for episode length?
We’re trying to go 25 minutes, but we’ve kind of been bursting over that.
Right, and even as fast as I can talk, there’s no way we can cover everything about Linux in that time.
So I guess what you’re saying is that they can look forward to more Linux-y goodness in the future.
Every chance I get.
Okay, so I should ask this, then: do you think you’ve covered everything you can for now? Well, I think it’s definitely been a good start.
So Kong, why don’t you take us back to the root directory.
You bet. I’m Kong Yang.
I’m Leon Adato.
And I’m Patrick Hubbard, and thanks again for watching SolarWinds Lab, shutdown 00:20.