Join Head Geek™ Kong Yang and Chris Paap, Product Manager of Virtualization Manager, as they walk through three methods—active alerts, PerfStack, and recommendations engine—to troubleshoot any virtualization performance issue.
Hi, I’m Chris Paap, product manager for Virtualization Manager.
And I’m Kong Yang, Head Geek here at SolarWinds. Chris, thank you for being on the show.
Thanks for having me.
On today’s show, we’re going to talk about troubleshooting, finding the root cause of any given issue in your IT environments.
And what we’re going to go through is some use case scenarios using Virtualization Manager to narrow down that radius of troubleshooting to identify what the root cause is.
Perfect. Before we jump into that, let’s talk about three things that are involved in troubleshooting. Usually, it revolves around people, process, and technology, specifically on people. Now there’s some folks out there known as IT seagulls. What are IT seagulls? IT seagulls are people who swoop into your project…
And what, make a mess?
And then fly on out.
Yes, and then they take credit for any potential success. Those are IT seagulls.
Typical situation is the world’s on fire and you’re trying to figure out what it is, and you don’t know where to start, and somebody’s already saying it’s your problem.
Exactly, the blame game and usually IT seagulls are at the management layer, because they’re obviously not on the front line and not getting the proper information that they need to reduce that troubleshooting radius and actually root cause what the issue is. So there’s three things that you can do to mitigate the risk of IT seagulls. One, stick to the data.
It’s important, because if you use the data as your backbone, you can identify what you need to focus on.
Exactly, and it eliminates the blame game. Number two, speaking of not blaming other folks in your IT department, collaboration.
It’s no longer good enough to say it’s somebody else’s problem. You always have to correlate that data across multiple pillars and multiple groups to find out if you can find common ground.
Yes, and I love how you mentioned “correlate that data,” because when you correlate that data across teams over time you get the proper context to root cause those issues. And with that, now you can start to attack that trouble spot. The only thing missing is a process. The troubleshooting process. A lot of times, IT pros come into the business new. They don’t have a process from which to leverage to start solving problems.
I see that quite a bit, especially when somebody’s new, or even new to an environment, they might have history and background, but they don’t have a systematic approach to figure out what the problem is. And what ends up happening is, because there’s no system in place, they spend a lot of time trying to find out what the problem is versus the time fixing the problem.
Exactly, and speaking to that process, it’s quite simple. There’s multiple models out there, but the one that I like is one that I’ve written about on Geek Speak on THWACK. It’s simple eight-step problems, and you mentioned defining the problem, that’s step number one. You’ve got to at least state the problem that you’re trying to solve.
I think the key is when you’re going through whatever method you use, even if it’s not the method we’ll talk about, is to keep it simple. If you have 50 steps, that’s way too hard. But it may seem very easy to say, “Find the problem,” that does nothing, but so many times, we’re trying to troubleshoot something that we don’t know what we’re trying to figure out.
Exactly, and I like the notion of keeping things simple, because along that front, you have to gather and analyze the proper information, whether it be performance data, whether it be logs, events.
Exactly, or if it’s none of the above which tends to be a common problem, is that you don’t have that. How many times have I been in an environment where you think you have the data and then you go to correlate that across, like what are my logs saying? And you realize there’s a gap in your back end. Like, you are not pulling the required metrics to support your thesis on what might be wrong.
Yep, you’re reading my mind, my friend. The next step is to form a hypothesis, because once you have the data, once you define the problem, you’ve got to state what your potential solution is so that you can go about solving that trouble spot.
Exactly, most time spent down is trying to figure out what that issue is. If you can reduce that, you’re halfway there.
Exactly, and then you put that plan into action, and then you observe, again, performance data, to see if that issue has gone away. And then you repeat and rinse that same process, and once you find a solution, you document that. That’s the one area where IT pros tend to lack a lot of expertise and experience or they put it on a lower priority. They don’t document solutions very well.
You get too busy. Something else happens that takes your attention, but it’s key because no one person’s an army. You need all the people on your team, especially if you’re going to get ahead of the issues and work on other things other than just putting fires out all the time. That’s going to help disseminate those issues across multiple team members.
Exactly, and what I find in our industry is troubleshooting scenarios tend to, especially the ones that escalate into disaster, recovery issues or ones that interrupt revenue-generating business apps, those are forcing functions on IT pros and their careers.
Exactly, and a lot of times in my experience, when you’re in a disaster recovery, it just is a disaster; there’s no recovery. [Kong laughs]
Well, I have a short story on that forcing function, so there’s an analogy that I like to use. The three envelopes analogy, it’s applied to CEOs and so forth, but it’s so apropos for IT pros. Three envelopes. First envelope: when you run into a troubleshooting issue, which you can’t resolve in there, open up that first envelope and in it, it should say, “Blame your predecessor.” Because it buys you some time, and then hopefully it buys you enough time that you root cause and resolve the issue.
Keep it from happening again.
Now the second time, if you run into an issue which could cause you to be fired and you can’t troubleshoot it within the window, that second envelope should have a letter that says, “Hey, reorganize, move yourself about your organization,” and hopefully that’ll buy you some more time in your IT career. Now, that third envelope. So, three strikes and you’re out. That third envelope. In it, that letter will say, “Prepare three more envelopes, because your successor is going to need those three envelopes.”
And that’s what we’re hoping to avoid here at all costs.
Exactly, so with that, let’s jump right into the demo and show an example of reducing that troubleshooting radius with Virtualization Manager.
All right, so we know we have a problem in our environment. That’s where we’re starting at, and we just don’t know what the issue is, so what we want to start out with is the virtualization summary pages. There’s a lot of resources on this summary page that allow for top ten counters, vCenters with issues, Hyper-Vs, hosts with issues, guests with issues, that kind of bubble those things to the surface so you don’t have to drill down as much. And then you can drill down into them. All of them are clickable and you can see what the issue is, if it’s a memory or CPU issue or something else. You can simply just go through the discovery that we have here, and then kind of drill down until you find the VM in question that’s having the issue, and actually click on it. One of the best areas to go, though, is actually go into are All Virtualization Alerts tab. What that’s going to do is going to flag all the things that happen within the hypervisor, things that aren’t very transparent for if you’re just looking at the guest metrics or the host metrics. These are actually things that’ll be flagged that are specific to a virtual environment.
See, I like this, and I liken it to the easy button when troubleshooting for a virtualization environment because the all-active virtualization alerts, by default it’s pulling all the most common issues that virtualization environments run into.
Think of it as low-hanging fruit, a shortcut to finding your problem, if there is such a thing. Taking your eyes to what’s critical and kind of getting rid of the minutiae of detail that might be there. All active virtualization alerts is critical because if, for instance, our problem was with the triggering object, in this case was this VM, we could click on it and we know that hey, we’re having a performance problem. It’s because of VM CPU ready. Now, any virtualization expert will know that that’s over-provisioning of virtual CPUs, which causes contention and causes slowdowns. We give an explanation for that. So, we’re taking into account here the expert and the remedial beginner- level virtualization person that can now figure out what that means. What is right? Some people don’t know what baseline is, and so they can find out what the baseline is, and the history of this alert on this specific VM, and they can scroll through it. You can then—what’s good about this is, then go in and drill into the VM even further and so you have the drill down model that we do so well on the product.
Awesome, I love the fact that Virtualization Manager takes the time to explain what the root cause of the issue is, why that issue took place in your virtualization environment. It has the context of what’s being affected, what triggered the event, as well. And if we step into the details of that virtual machine, what I like about this is now you get into the data and the context of that data, because you can see CPU memory, I/O metrics along with this piece. You can take management actions. And like I mentioned earlier, it is the easy button in troubleshooting with Virtualization Manager. In fact, this dashboard also has the AppStack environment. So when we’re talking about troubleshooting radius, you really want to start with what touches all the connected context to that particular VM, because all active alerts show that virtual machine had an issue, what host it was on, which Virtual Machine it was on, what the issue was. But what does that Virtual Machine touch in terms of servers, storage, application, and the AppStack view?
Exactly, it’s that—again, going back to I never believe when any of those people say there’s a shortcut, but it is a way to get around the detail that you’re inundated with, and see what exactly—what applications are being affected with this? Or if we come through and we see VM CPU ready is the issue here. If we didn’t know this and we were looking at the AppStack, and we’re drilling through and we see we have storage issues that are directly related to this VM, then maybe our issue is not a hypervisor issue, and that goes back to pointing the finger. If you’re correlating this across teams and you don’t have the storage administrator on your team, you can go back and check that out with a storage administrator and dig into those details, so you know that you’re not following a red herring. You’re actually digging into it and the issue is a storage latency issue or it’s something else, it could be network. And it quickly identifies that to then focus your attention on that area.
Exactly. So in this scenario, the alert came up and helped us quickly root cause what the issue was. In this case, CPU ready. In our next scenario, we’re going to go into it blind, like what most IT pros tend to happen when a ticket comes through and it says, “App is slow,” and then the blame game starts.
Where do you start? What’s next? That’s where your discovery process happens and you have to drill into finding what data’s important to you and identifying what the actual problem is.
Chris, let’s walk through this second scenario that we talked about where it’s a “new-to-you problem.” Ticket comes in; it says, “App is slow.”
All right, so we have a situation where we have no alert, we have nothing that supports it, and it’s a lot of discovery in this. Like, how do we get through that discovery and save time?
And even looking at it, I mean, there’s nothing in our dashboards.
Our easy button’s not really helping us.
Yeah, where do we go? Is this false, is it user error, is this actually real, or do we have an infrastructure problem? In Orion now, the solo ones, we have performance analysis, commonly referred to as PerfStack. It brings us up to a blank canvas. You can actually click on and add entities that you want. In our case, since we’re talking about VMAN or VMware or Hyper-V, we want to go through and add a virtual machine. So, we go and choose what we want to see. Now we have virtual machines to choose from. From this, we select and then we add the selected items.
We call this our palette, and from this, when we select it, we have metrics that are defined— or counters or events and alerts— that are defined for that VM. The same would be true if we chose a physical host or even a network node or an application.
Got you. So, we’re starting to reduce the troubleshooting radius. If you think of it as a circle where radiuses are, we started with an initial ticket that says, “App is slow” and now we’ve narrowed it down to Virtual Machine.
That the app presides on.
So in this case, we’ll go through, so a typical usage case would be like let’s add CPU and memory counters, just see how they’re doing. So if we went through an average CPU it’s just drag and drop. What’s nice about this, the reason they drag and drop is it’s touch pad-friendly.
So we added a CPU.
Let’s add average memory. So, if you notice here, we can drop here to add a metric to the existing counter, so we can overlap it. Or we can drop it so it correlates beneath it as well. So this comes into play, so if we were looking at memory and CPU of the host, as well as memory and CPU of the VM, we can overlap those on top of each other, see if one correlates with the other. Or we can keep them separate so it’s a nice clean graph.
And that’s great for anomaly detection if you have an event that creates a spike, it’s great to see, like you mentioned, if those metrics correlate to one another.
The idea here is visualization means so much in terms of identifying where that key spot is. Like, was this a one-time blip? Has it been occurring; has it been reoccurring? Because this is a timed series of events.
So we’ve got memory, we’ve got CPU. Usually we blame storage, but even before that, we blame the network. So can we add metrics from storage and network?
Absolutely, let’s go and average our apps total. Let’s add this here, and then we’d even break that out to read and write. Apps read and let’s put apps write on there on top of that, and see how they correlate with it. So it looks like this— in fact, in this machine, not too high on IOPS load, but we know it’s a lot more read intensive, which kind of falls in line with most VM workloads.
Okay, and given this, as you’re adding some network counters in there, let’s add some network pieces in there. This has now crossed across multiple teams, because you got the infrastructure team that might have responsibilities for compute memory, the hosts in their storage team, if they don’t fall within the infrastructure team, they do their own thing. And now the tie that binds them all, the network. Into that, what can one do to talk to— now that you got the data, how can we talk to collaboration? We talked about how to get rid of, mitigate, the risk of IT seagulls, and part of that was collaboration. How does PerfStack allow you to do that?
Two ways. One, the worst thing you can do is pull it, put together a whole palette of different counters that tell a story and identify what the problem is and then have to recreate it every time you show somebody else. You can do two things. You can save this and have a report. New Analysis Projects is where we’ll keep it. Save it to whatever you feel is necessary, so when you’re asking from somebody from your own team or a different team to go in here, they can come in, load, and find projects, and now they have access to the project that you just did. Load it, so they’re seeing what you’re seeing. So, what you have is the same common baseline, so you’re looking at the same data.
Okay, and the second method?
The other method is you can— actually, once you come in here and if you added alerts on this, you can actually save this out and export it. [Mouse clicking] So you can now turn this into an attachment and that you can then email out. It comes in very handy when people such as upper management don’t have access to the monitoring platform; they have access to more dashboards. This is a dashboard, but it’s more of a troubleshooting functionality, and discovery functionality to find the issue. Management generally likes to receive their selves a nice package. We can now go in there and send that to them.
Absolutely, so the collaboration piece there, you can get the proper SMEs to look at it. So, if you have virtualization experts, they can look at the virtualization-centric metrics and performance logs, trends, analysis. If you have storage SMEs or networking SMEs, they could look specific to that and the teams can get together and collaborate, and stop the blame game essentially.
Yes, the idea of this, yes, it is to find root cause, if you are looking in your environment. But the real value of this is having a common ground that different teams can look at and then make decisions based on that. A common occurrence is you go to a DBA, you go to a network engineer, and they don’t see the same thing you’re seeing because they’re using a different tool or they’re looking at different parameters of the same counter, different time frames. This kind of puts it all in the same set so you’re on common ground.
Yep. I liken this to normalizing the troubleshooting experience for all SMEs involved. And when we say SMEs, we mean subject matter experts.
Exactly, it puts everybody in a spot where you’re not pointing fingers, but you’re actually evaluating the data. Going back to what we spoke of before, and in multiple different troubleshooting sessions, what you go through is, focus on the data. What do you have? And once you focus on the data, that’s a truth that you can work off of to find that issue.
Exactly, this will allow you to do multiple iterations until you get to the root cause of any given issue. So, a very powerful tool that can be customized for your specific issue that comes up in your environment.
Exactly. Okay Kong, we’ve gone through several scenarios. The first one being identifying using the home screen and All Active Virtualization Alerts to identify what the problem is, bubbles up to the surface, easy button, resolving what the issue is. Second one was having no contextual information and doing a lot of discoveries in PerfStack until you drill down and find correlating data that may find the root cause. And the third one that we’re going to jump into is actually jumping into recommendations and using the assisted functionality that VMAN has to actually resolve the problem for you.
All right, let’s jump right into one of those scenarios.
So on the home page, we actually have All Recommendations. It’s grouped by high level of your cluster hosts with recommendations. But if you click on the All Recommendations page, it’ll give you an enumerated list of all recommendations you have. So we have two types. In this case, we’d have a problem; we have Active Virtualization alerts. That means that we have a current issue, and it’s something that’s an after-the-facts fix, aka remediation resolution. We do have what you’ll see in here as well that we’ll touch lightly on, is we do have predictive recommendations so you don’t get into this situation. So, although we can fix any problem you have, we’d much rather be resolving those ahead of time, so before they even hit that criticality.
Good, good, and by default, it enumerates it by severity.
Correct, so critical things will be climbing up to the top. Things that are a warning threshold are below that, and things such as informative like snap shots that are out there will be after that. So if you click on the actual recommendation, it’s going to give us information of what the recommendation is. But true to form, what we’re trying to do throughout all our products is, aside from just alerting, we want to give you an opportunity to actually fix the problem. So in this case, what we’re asking to do is decrease the CPUs on this VM.
Yeah, it looks like this VM has only been averaging 3% at CPU utilization of the past four weeks. So, obviously you’ve over-allocated virtual resources that could be used for other virtual machines.
All right, so if we actually decide in this case to select it and execute on this now, it’s going to show it’s running. It’s going to actually go do the work for us and bring it back up in a functioning state, if it does take it down, if it’s not a hot ad situation. Say, if it were something that would stop the application you would want to take time and change control to actually schedule that. So we have the scheduling functionality that will actually allow you, so you’re not up at 3 AM doing it. You can set that up and send an email when it’s completed and then you can go verify.
Perfect. And I like how as you take the management action on the recommendation, it has a step-by-step process of what exactly it’s doing.
All right, and in this case, like in this example is a perfect example. It didn’t successfully finish, had a problem executing an error on powering the VM on. If we go through, it could be that it had media attached to it, that it was doing another function. Maybe it might have been rights that I have on this, but we do give informative details of what’s going on with the VM, why it didn’t power back up. And we keep running history of what’s going on with that as well.
Perfect, thanks Chris.
And we’ve covered three things that any IT professional can use Virtualization Manager in their troubleshooting efforts.
The first thing, the easy button, the All Active Alerts.
Quickly providing context of what the problem is. And then being able to drill into that problem and find out when it occurred and how to fix it.
And the AppStack view within that is super powerful.
Yeah, showing what the relation is to other environments.
And number two, when you have zero context and you have to build that case, narrow down that troubleshooting radius to be able to add performance metrics from different subsets.
That’s right; you’re essentially providing the context. You’re building in context as you’re drilling down and finding what information is important to you to figure out what the root cause is. And then providing a visualization that you can share with other teams.
And lastly, data says, who doesn’t need an assistant? But leveraging the recommendation engine, so that you can take management actions to help you troubleshoot and remediate issues.
Right, you can either schedule it or execute now, and it resolves the problem for you. Or, what I think is even the better option, is the predictive recommendations that keep it from becoming a critical issue in the first place.
Perfect, and folks, we will have links to troubleshooting tips as well as the VMAN 411 page in the helpful resource section, so look for those. For SolarWinds lab, I’m Kong Yang.
And I’m Chris Paap, thank you for joining.