In this episode, Head Geeks Thomas LaRock and Leon Adato discuss the evolution of life as a system administrator, how simplistic alerts are not as helpful as you might hope, and how PerfStack will give you the insight you need to find the root cause of performance issues.
Looking at the kids’ baby pictures?
Wait, wait—you’ve got a photo album for your monitoring setups?
Well, not really monitoring setups, but pictures of me when I was a young and naive monitoring engineer. You know, back in the days when we all thought that monitoring a single metric was going to be good enough.
I remember those days. So young, so carefree.
Lasted what, about two hours?
Oh, no, 17 minutes, 43 seconds for me.
And, you know, on Lab, we really haven’t spent a lot of time talking about really robust monitoring.
How can you even say that? We’ve gone through NetPath and AppStack and AWS Integration and PerfStack.
True, true, true. Well, I mean, we don’t really talk about getting past those first simple baby steps into meaningful monitors and actionable alerts.
Okay, so you mean like not just going for “the CPU is over 80%” kind of thing.
Right, exactly. So let’s get that started. I’m Thomas LaRock.
And I’m Leon Adato, and welcome to the SolarWinds Lab that we’re calling Monitoring 201: Beyond the…
Beyond the Basics.
Beyond the Basics, yeah.
Last year in our Monitoring 101 series, we took you through the introductory material that every monitoring engineer…
And manager, and the people who enjoy the benefits of monitoring.
Fair enough. And what they all need to know. But today, we’re going to talk about going from those first steps and moving to monitoring alerts that are actionable and that are going to make a difference in your enterprise.
Now, like our Monitoring 101 series, we’ve got an eBook to go along with this information, which we’re going to link in the show notes.
Meanwhile, if you have questions, feel free to ask us in the chat window you see over here. Now, if you don’t see a chat window, don’t panic. That just means you aren’t watching us live. And of course, you want to watch us live.
Of course, it’s always better live.
And so head over to lab.solarwinds.com and sign up for reminders on the upcoming episodes.
So I think it’s important, before we dive right into the technology, to really lay out the problem. I mean, SolarWinds provides tools that let you monitor pretty much any metric that you want. So what’s the issue about monitoring or alerting on a single metric? Why is that an issue?
Because if you’re only looking and focused on that one single metric, you don’t get any context that might also be around that metric. So, for example, the CPU is greater than 80%.
Is that a problem or not? Well, if it’s a virtual machine, you might actually want a high percentage of CPU usage. And why is it 80%? Is it one SQL query that’s gone bad, or is it maybe four or five apps are all running at once?
There’s so much information underneath just that one metric.
And I think that as people are new monitoring engineers, they tend to look for that one silver bullet. You know, you’re digging through the Microsoft documentation and TechNet and you’re looking at KPIs and the published key indicators and you’re saying, I’m going to find the one thing, and I’m going to monitor that one thing, and that’s going to drive you to know when something is wrong. And the reality is that, for people who are more experienced, it just doesn’t work like that. So using your CPU as an example, I actually pulled up something just to illustrate that. What you’re looking at here, I pulled in, and I did a report possibly the worst way, because I’m not stacking the CPU information all together. But what I did was, I took a router here, and that has CPU that’s pretty low. Okay, fine. I’ve got a server here. That’s got CPU that’s, hmm, middling. I’ve got another box here, a Linux box running SUSE that, you know, is okay, fine. This box is doing nothing. This box, oh, wow, it’s up at a hundred— but there is no reason to suspect that this is bad.
No, not yet.
You know, we don’t know exactly why. And we’re going to talk later about how you know what’s wrong and why you know what’s wrong, but I just wanted to point out that when you start to look at these charts and think, well, would I wake someone up at this point if it was two o’clock in the morning? Would I tell them, oh my gosh, you have to go get this? Well, it depends, because if I look a little further on, it went really high, and then it goes low again. And then if I look further on yet again—okay, well that’s high, but again, that may be perfectly normal. You don’t know. You don’t have, like you said, you don’t have that context.
So that’s what I meant. We were talking before about some database examples that are similar. I didn’t know if you wanted to show…
Oh, I do. I’ll show those right now. So yeah, some database metrics to look at. Here’s one of the most common ones that you’ll come across when you’re first getting involved in database administration. What I have here is a picture of DPA. We’re looking at just Monday, March 20, so just happens to be yesterday, midnight-to-midnight view. And what I have down here is I have Buffer Cache Hit Ratio. What does that tell you so far? What was that?
Well, the buffer cache is being really used very well.
And of course, Buffer Cache Hit Ratio just says, are the data pages in-memory, and am I hitting those? Hits versus misses ratio, right? And so it’s very simple, and it says, you know what? I’m finding all the pages I need in-memory. So, if you’re alerting on this as a metric of, say, memory pressure, well, you wouldn’t get any alert for yesterday. Now, another common one though is page life expectancy. So you can see page life expectancy here does what it naturally does. It goes up and it goes down. So page life expectancy is simply how long do I want my pages to stay— how long would you want your pages to stay in-memory?
As long as I need them?
How about forever?
You’d want them forever. So you can see the level of seconds. It gets up to about 30,000, and then there’s a significant drop-off. So all the pages were flushed out of the memory for some reason. Now, is that a problem? I’m not really sure. You can see, over the day, it kind of happened twice. But wait, if all my pages have been flushed out and the page life expectancy is near zero, how could it be that the buffer cache hit ratio is still at 100%? So you can already see. If you’re just focusing on one and you don’t have the context of what else is happening, then you might be missing a lot of information, and that’s when we talk about a meaningful alert or an actionable alert.
What is it I need to do? So down here, what I have is now, this is what we talked about, the advanced accidental administrators, right? Now you learn that actually, what you want is a combination of these metrics. So what I have here is what I call the buffer pool I/O rate, and that is simply a metric that says, how much—how many—what’s my throughput? Am I churning through the pages in-memory? At what level is that? I have it measured in megabytes here. And this is based upon how much RAM is installed in the server, and it’s just how many total pages are going through. And basically, it’s churn. Because I want to know, am I getting a lot of throughput as these other metrics are happening? So you can look and see what happened here. You can see that it had a huge spike, and I made a default that says 28 megs, and I made a default alert of 20, just for no other reason. But you can see that is most likely what caused the page life expectancy to drop, because I had a query that came in and said I need to get more pages, and I’m pushing a lot of stuff through the memory. That’s probably all perfectly normal. This is the way a database is supposed to work. Now you can see over here, the page life expectancy had a bit of a dip, and there wasn’t as much throughput. Now I might want to investigate that one. Now I have more meaning about what’s really happening. Maybe somebody flushed the cache manually, I don’t know, but I’m probably going to want to go look at the activity, because now that seems weird. But again, at the top, Buffer Cache Hit Ratio tells me everything’s fine. Page life expectancy just says it took a dip, but you have no idea why. But now I can see I didn’t really churn through my memory. So let me go look at what might have really happened there. Now there’s some action for me to take. So you need that combination of metrics. If I look at any one of these by themselves, not enough information.
Really have to look at a group.
And I think as the beginner of the accidental DBA, you know, somebody who’s just getting into it, and they got DPA or they have a database-monitoring tool, they’re going to say, again, I’m looking for that one thing. Buffer cache hit ratio. I’m going to alert on that one. And somebody else says, oh no, but I also want to know on page life, and you’re going to be triggering on things that aren’t necessarily, as we said before, actionable.
So what experienced monitoring engineers realize is that the value is in having and/or, either/or. The correct metric, not just the simple one. You don’t just go for CPU because it’s obvious and it’s shiny and it’s big out there, but you look for the thing that’s really going to be the real causal factor, but more importantly, having the combination of factors. It rarely is something actionable when a single metric has hit a threshold or gone over or whatever it is. It’s usually a combination of things that tell you, oh, now I need to get out of bed at two o’clock in the morning or I need to work, right?
So, you know, in the example with CPU that I like to talk about is that people trigger when CPU is high, and that doesn’t matter. If the CPU’s at 85 or 90%, I call that correctly sized. But if you have jobs waiting in the processor queue and it is greater than the number of CPUs you have, and your CPU is high for a length of time, that’s when you know that the machine isn’t able to keep up.
Yeah, you may have mentioned this once or twice before.
I have mentioned it since my very first appearance on Lab. It’s an oldie, but it’s a goodie.
So but the expert monitoring engineers, they know to take it a step further, because they understand that they’re in this unique position where they need to— it’s not just that single metric, but they have to collect more. They have to be data collectors. They’re basically becoming data scientists.
You’ve been hanging out with Patrick, because that’s his line.
A little bit, but you know.
It’s a good one.
But it’s true that when you hit that advanced stage, the advanced accidental stage even, it’s that you understand the power in the grouping of correlated metrics.
Right, absolutely. So we will get into PerfStack.
PerfStack. [Thomas laughs]
Yes. We will get into PerfStack near the end of the show, but we’re going to build up to it.
But it’s not just PerfStack. It’s the idea behind the tool that matters. The example I give is when you’ve gone virtual, when your database servers became virtualized, is understanding that there’s more between the end-user and the data and that you have to be able to measure all those layers in between.
Right, exactly. So we’ll dig into that, but I want to, again, I want to build up to it. What I want to do is I want to go over that CPU, just to show a quick build, and then…
Because you can’t let that go.
I can’t let it go, but the fact is that it is something that, again, new monitoring engineers, that we find, as people discover SolarWinds tools, that is a discovery that people have repeatedly. It’s not just something that we talked about one time, and now everybody’s got it. There’s always new people discovering the value of monitoring, and monitoring glory, and all that.
All right, let’s take a look.
So, as we said, CPU is high. I don’t need to show you that. We know we can monitor CPU. But the thing I want to show you here is this one other component, which is the processor queue length, which is, as the name implies, the number of jobs waiting to be processed. And you can see I’ve just got some statistic data here that it’s up, it’s down, it’s whatever it is, whatever the number is. But these are the number of jobs waiting to be processed. Now, on top of that, I need to know the number of processors I have on the machine. When I originally talked about this on THWACK, I had this whole query, and it was all built into the alert. And this is why I say things have changed, because with web-based alerting and with some discoveries and especially in our larger environments, looking at you Jay Bigley on THWACK. We’ve discovered that perhaps querying the number of CPUs every time this alert runs, which is every minute, may not be particularly useful. Perhaps you know about running long-running queries constantly and the effect they may have on, say, the SolarWinds database.
A little bit.
A little bit. So we came up with another way of doing it, which is to run a query on the backend. Now, I did it as a stored procedure, and I know you have many feels about stored procedures.
All the feels.
You have all the feels about it, but just as an example: I’ve got that, and I just want to show you what it looks like. So here’s a query that— and we’re going to have this linked in the show notes, also. This goes through, and you can run it every hour, every day. You probably don’t need to run it more than a day, because the CPU’s on a machine.
CPUs don’t really change that frequently.
Again, we’re talking about the number of CPUs. On-prem is not ElastiCompute. So this query goes through, and there’s a custom property, a custom property called CPU count, and this will simply update the CPU count to be the number of CPUs. So, just understand that we’re now talking about two things. We’re talking about a PerfMon counter that collects the number of jobs in the processor queue, the Windows processor queue length, and we have a back-end query— run as a stored procedure, however you want to do it—which calculates the number of CPUs on the box, that places that number in a custom property. And once you have that, then we can go to the…
Then the magic happens.
Then the magic happens. So here we are, in the alert. And let’s see what we got. Well, the first thing is, I’m setting the scope of my alert. I love doing the scope because it really narrows it down. It also allows me to know exactly which machines are going to trigger for this alert, so I just talk about my Windows boxes, because processor queue length is a Windows PerfMon, and then I’m saying when the CPU is greater than 90 for more than 16 minutes, so approximately three polling cycles. And then also, and this is, again, the wonderful part about the web-based alert manager is that before, this had to be a really complex SQL query. Now I can do it straight through the GUI. GUI is good. Again, so now I’m looking at a component alert. I’m combining it with a node alert. Again, just to show you, this was a node-based section. This is a component-based section. Again, narrowing it down just to Windows machines, and I’m looking for where the component name is Windows processor queue length and the statistic of that component is greater than CPU count, that custom property, CPU count. So that is my trigger. And I really don’t need to go any further than this. So again, using the web-based alert manager, we’re able to see where the CPU is higher than whatever the threshold is that I want, and also the component, the PerfMon component. Windows processor queue length is greater than the value of the CPUs, which we’re collecting in that backend process.
So that’s the best you got?
Well, okay. So, it is my favorite one, but it’s my favorite because it exemplifies the idea that you can’t just take that simple alert. But it carries over into a lot of other situations. Another simple one is memory. I mean, you know, CPU and memory sort of go hand in hand. You know, when memory is high… Here, I got an example. I have a box here, and we can see the memory is really high. It’s at 89%. And the junior monitoring engineer would immediately throw an alert. Ah, it’s really high. But we know that there are applications that just automatically grab a lot of memory. Like SQL Server will automatically grab…
Whoa, whoa, whoa. You need to stop saying that. Like, yesterday.
But, okay, okay. But you get my point.
No, SQL Server does not just go and grab the memory like that. It does what you tell it to do. Maybe your point is you want to spread alternate SQL facts.
Well, okay, but my, what I’ve noticed is that if you don’t do any configuration on SQL Server, that’s what it’s going to grab.
Yes, if you are installing and leaving all of the defaults, it is possible that SQL might grab more memory than you desire.
That is—okay, fair enough. My point is that high memory by itself, regardless of whether it’s self-inflicted or not, isn’t by itself telling you anything’s wrong. Because the fact is that even if your SQL Server is configured to grab 80%, which is what it does by default. [Leon laughs] That’s okay, that’s normal. There’s nothing wrong with that. So I have another look at this. I have another view of this.
Which is that when memory is high, that tells you something’s going on. But again, that could be just correctly sized. What you really want to know is when the page file utilization—okay, on the box, when that is high, as well as the pages per second is high. So I’ve set this up. I just want to show you what I’m talking about.
So here, you know, memory utilization is high, fine. I’ve also set up a component to look at the page file utilizations. That tells me that we’re just using 30% of the page file. So it’s really not having to page out. Now, this is almost the opposite of what you were talking about with the SQL Server. Here, paging out means that I’m running out of memory, and I need to page it, and that’s not so good. So if this was high—say, 90%—that might be a problem, but even by itself, again, it’s efficient. It’s using up all the memory. It’s now using page file. That could be just because it’s busy. The third component, the third element that you want to look at, is the number of pages per second. If I am throwing pages out and reusing them, now I know there’s a lot of churn. And if the formula that I’ve typically gone with is when the memory utilization is over, let’s say 90, and the page file utilization is over 90 also, and the number of pagefiles, pages per second is over 25.
Then I know on a Windows machine that something’s going on. Time to get out of bed. Time to look at what’s happening. But the point is, you know, all jokes about SQL Server aside, although they’re funny, is that– It’s funny to watch the little vein on his forehead. The point is that as you’re looking at the problems you want to act on, you need to figure out what those causal factors are, what filters into it. And the best people to tell you about that, if you’re not sure yourself, are the people who are responding to the tickets now. This is how I found out about this. I talked to our overnight NOC, and I said, I said, how’s this alert working for you? And they said I hate it. Well, really, why? Well, it doesn’t tell me anything. What I really want to know is… And that’s where the magic happens. That’s where the golden conversation is, is what I really would like to know, what really matters to me. And that’s when you can get into that kind of thing. So, those are just two examples. There are many more examples in the Monitoring 201 e-book, which, again, we’re linking in the show notes, and we have network examples, and we have virtualization examples. I just gave you two just to sort of, you know, whet your appetite a little bit.
So I’ll see your favorite CPU metric, and I can show you one of my own.
All right. So, real quick I’ll just show you one of my favorite CPU metrics, which really isn’t a metric that you can just find. It’s not in PerfMon. You have to do a little math. You have to take a couple of PerfMon counters and do some division.
I know, don’t panic. But it’s called Signal Waits Percent. So what you can see here is for a server, and I picked this one. It’s got some utilization. Not a lot. I picked this two-hour timeframe, and you can see that the box, the guest itself is at 11, 12%. The instance for SQL is using nine or 10 of that, so a little bit of overhead for the guest itself. As you can see, there’s this metric called VM CPU Ready Time, which is interesting, because that says simply the guest asking the host for an available scheduler. Because you know, that’s how the resource utilization works between hosts and guests. I sometimes just call it sort of a round robin. That’s not really how it is. It’s more of, hey, the guest says I need something, and if the host can give it, it will, or the guest may have to wait, because all of the processes are busy right now. That’s how virtualization works. So, VM CPU Ready Time is a nice indicator to let you know if your guest is waiting for an available scheduler. Well, guess what? SQL Server has this thing called a Signal Wait, where it’s simply a request, what you’d call a query, is simply waiting for a signal from a processor that says, I’m ready to take your requests now. So, you can go into your wait stats, and you can calculate, very quickly, you know, the amount of requests that are simply in a runnable state and compare that to the overall number of requests that are out there and say, you know, if you have a large number of requests that are creeping up, simply waiting for a schedule, just like you talked about. What are the number of SQL jobs, and how many processors do I have, and what’s the available amount? Maybe I want to alert myself if I get to one and a half times, or twice as many. In this case, we do it as a percentage, and I say, well, you know, what are the total waits that are runnable and waiting? If they get about 25%, maybe you want to let me know. In this box, of course, now that’s not the issue. So when I look at this, if I was to happen to see that there was CPU pressure, I would know. Well, actually, is there a bunch of queries that are trying to churn through the processor, and maybe you only assigned one or two vCPU, and maybe that’s not enough. And I can see that I just have a lot of little requests. I have a highly transactional database that’s just trying to spin through a whole bunch of, a flood of requests at once, and there’s not enough processors to handle the load. So for me, that’s another little example of where you need more context than just saying, oh, the CPU is high.
So I want to build through the thought process leading up to PerfStack, because I think that SolarWinds likes to show PerfStack. Like, we like to just drop right into it, because it’s awesome and really exciting. But sometimes, you don’t always follow the train of thought. So I want to build up to how SolarWinds got to the PerfStack experience, the PerfStack solution set. AppStack has been out for a while. I think that a lot of people are familiar with what it looks like. The really awesome part about it, for me, is the more modules that you have, the more data feeds into it, so that you know, when there’s a problem, which components are involved. Here I am on the AppStack screen, and I can see that I have a few applications which are slightly unhappy. Hopefully your screen never looks quite this red.
It’s very Christmas-y.
It’s very holiday-esque, indeed, it is. Certain holidays, at least. So you’ve got that, and let’s just say that this particular application, this Microsoft SQL Server running SQL is bothersome to me.
Of course it is.
It is a challenge, and I can click on that, and it’s going to go into spotlight mode and focus on just the items that are involved in this. There we go. So I can see that this application is involved, but there’s also some other applications which are involved in some way. I can see that there are some transactions I’ve set up that are also being involved, and all the way down the application stack, which is amazing, because sometimes it’s hard to know, as I like to say, that the elbow bone is connected to the wrist bone, is connected to the finger bone. Like, how do you know which ones are involved? The challenge that I have with this is that it tells me how things are right now. It’s this point in time, not what was happening five minutes ago or a half hour ago. I just know there’s an alarm of some kind on here. Normally, I would take, I would go from here and say, all right, so this is Microsoft SQL Server, that box is having an issue. Let me dig into this. And I would go to— Well, I happen to have AppInsight for SQL. So I would jump into that screen and start to dig through. I would see something like this.
One screen after another.
You know, and I’d be going through, and from here, I would start to go from, let’s say the AppInsight screen to the node screen trying to figure out which things. I’d be going back to my AppInsight screen and say, okay, so wait, what else is involved? I’d go to the storage manager, and I’d look at the storage screen, and that’s where PerfStack picks up, is let’s find a way to show that information all together. So let me go into my app screen. Here I am. I’ve already loaded up the entity. This is, you know, Microsoft SQL Server running on that box right there, and the first thing that I’m going to do actually…
Well, hold on. But when you hover over that, there’s a little entity. That lets you show all the related entities.
All the related entities, right.
That’s the sugar, right? That’s the sweet part of, for me, PerfStack. So that little diagram, that’s where you start. You’ve already got them listed here, the metrics, but the reason you have that list on the right is because you’ve already clicked on and found all the entities related to that data. So, similar to AppStack, where you see everything, the spotlight view, everything related.
Right, so it’s all related. And the first thing I do here is, I will put on the actual alert, and that will be my first graph. And the reason why is because I want to keep track of when did this problem start, when did it stop. That’s how I know that I solved the issue.
Did it stop?
Has it stopped yet? Have I fixed one thing, but there’s still multiple issues involved?
And in the last 12 hours, by default there?
Right, so you can change that to be whatever timeframe you want. We’ll just keep it at 12 hours for now. So the next thing I start doing is throwing in more elements. For example, just based on our conversation before, I want buffer cache hit ratio. I’m going to make a new graph out of that. And my buffer cache hit ratio is pretty up there. It’s 100%.
No problem then.
Nope, there’s no problem.
No, there’s no problem.
So there’s no actual problem. This is a false alarm. No, it’s not a false alarm. That’s the whole point.
We don’t have enough data.
So, if I stopped at buffer cache hit, that would be simplistic. So as you taught me, we want also page life expectancy.
Well, yeah, it’d be a good thing to go hand in hand.
Right, and I’m tired of scrolling. So I just want to put in page life. Here it is, look at that. I’m going to put it on the same graph. I’m going to map it right there. And here, we can see that something is happening here. And we would continue to build this until we had a full picture of what the problem is, and we’d also be able to start making changes to the environment and see the solutions represented here. So that is sort of the top or the first use case that we like to— That I like to talk about to people when we’re in the conferences and things like that, as far as PerfStack goes. However, Destiny had another use case that she’s really having a lot of fun with, and it makes a lot of sense. We want to give full credit to her, but this is where I’m going to start a whole new one. This is where you use PerfStack to identify that your change went well, or that your effect or the thing that you were doing went well. So, as Patrick likes to say, these are bumpers on a change control. So I’m going to make a network change, a network configuration change. So we’ll start a new project here, just a blank page, and I’m going to this time pull up a router, and get all the elements that are associated with that router.
Let the sugar happen.
Let the sugar happen. Ooh, honey, honey. And what we’re going to do— the whole point of this, while it’s loading, is that I’m about to make a configuration change, and I want to make sure that that configuration change, whether it’s a traffic shaper or, you know, a tweak to the interfaces, to change the bandwidth statement, or I’m using another interface. Whatever it is, I’m channel binding, whatever it is, I want to know that that change, A, did what I thought it was going to do; B, didn’t have any unintended side effects.
So here’s my core router. You can see I have all these other elements. I’m going to bring up main interface, and I want to keep an eye on the peak bandwidth. I also want to keep track of interface errors, just to make sure that errors transmitted and received, and discards transmitted and received. We’ll put that on the same graph. So now, I’ve got a running view, and obviously, you can add more statistics as you want. I’ve got a running view of the impacted elements of this device, and of my network, and now, as I make my configuration change, let’s say that it happened here, I can see that, oh look, I relieved the pressure on the bandwidth utilization. I was actually able to fix it. So I can see that my change actually had the right effect. Now, you know, the opposite, I could have made a change and immediately seen that spike. And it’s like, okay, time to roll back. And that’s where NCM comes in handy. You can just put the previous config on, and everything goes back to the way it was.
I think that’s one of the greatest strengths besides all the sugar on the left for adding all these, but also, in terms of baselines, we all have to upgrade things and make changes as we go, and PerfStack’s going to make it really easy for you to compare before and after, and very quickly share the report. So if somebody says, oh, all we did was make a change, and now everything’s broken, hold on a second. Let me share this report with you that shows all the metrics that say nothing’s actually changed. So if you are having an issue, let’s go find where the root cause really is.
Right, so again, PerfStack is really good for both the in-the-moment troubleshooting and also as, again, those bumpers between changes to know that everything is stable. They grow up so fast.
Even better, SolarWinds through the years.
Aww. [Steady guitar music]
I didn’t think I could hate you more than I do right now. For SolarWinds Lab, I’m Thomas LaRock.
I’m Leon Adato. Thanks for watching. [Upbeat techno music]