Video

SolarWinds Lab Episode 52: Before You Upgrade to NPM 12.1: PerfStack, Meraki Wireless, and AWS Monitoring

PerfStack is here! Like NetPath™ before it, this new tool represents another huge upgrade to the Orion® Platform. To unlock its full potential, tune into SolarWinds Lab. Join Head Geek™ Patrick Hubbard, and Senior Product Managers Steven Hunt and Chris O’Brien for an in depth look at this powerful addition to Network Performance Monitor and Server & Application Monitor.

Learn how to use the new PerfStack editor, get tips and tricks to enable team troubleshooting, and discover the secrets to integrating PerfStack in alerts. As an added bonus, they will also teach you how to configure the new Cisco Meraki monitoring of NPM, and Native Amazon AWS Cloud monitoring in SAM.

Back to Video Archive

Episode Transcript


Hello and welcome back to SolarWinds Lab. I’m Patrick Hubbard.

I’m Steven Hunt.

And I’m Chris O’Brien and I still think your promotion of this episode was over the top ridiculous.

What are you talking about? PerfStack!

That.

Okay, but look yesterday SolarWinds released NPM 12.1 and SAM 6.4. And just like with Netpath, we added a new and very powerful tool, which we call PerfStack. But like with Netpath and AppStack, and all the other big features we recently added, they are going to need to know how to do a hands-on with PerfStack, in order to be max awesome!

Max awesome? You’re still doing it.

He’s generally excitable. Lab coat makes it worse.

Okay, how’s this? PerfStack, in many ways, is the follow on to AppStack– the automated context engine in the Orion Platform, which allows administrators to quickly identify all the components of an application across all layers of the infrastructure stack.

But what AppStack doesn’t do is give you an easy way to compare multiple metrics, events, and data side-by-side, based on that automated context. And that’s where PerfStack comes in.

Yeah, and there’s more to it than that. PerfStack starts with automatic associations AppStack is aware of, that extends your expert knowledge of your environment. So it lets you explore virtually any monitoring metrics from your entire environment and connect data points that only you as an engineer can connect.

Right, you can sort of think of it as a metrics laboratory, right? It’s a sandbox and you brew cross-system views across dozens of related metrics, and then either save them to keep an eye on them over time, or it lets you work collaboratively, and then share them with teams that need them without spamming everybody on the team.

See you can do it without shtick.

Shtick?

Shlock.

Really?

Really.

Okay, fine then. Chris, you’re the Product Manager for NPM. And Steven, you’re the Product Manager for SAM. Why don’t we do this? Let’s walk through a scenario where a network admin and a sys admin are going to work collaboratively to troubleshoot a really complex issue and try to identify the root cause. And we’ll do that using PerfStack.

Yep, hint. It’s not the network.

It’s the network.

Okay, well we’re about to find out. We’re also going to show you how to build, save, and share PerfStack views. And then I’m going to cover a couple of tricks like how to embed them in alerts.

Then, as long as we’re at it, let’s also take a quick look at native AWS monitoring with SAM 6.4.

Ah, that’s really easy to set up but we should definitely show it anyway.

And Meraki.

Chris, you can’t just say Meraki.

I think I can for all the customers that have been waiting for wireless Meraki monitoring.

Okay. Well then, this is going to be a great show. We’re going to show how to use the new PerfStack, we’re going to show native AWS monitoring for cloud, and Meraki. And Chris, PerfStack can also include cloud and on-prem, right?

Yes.

Well then, you could almost say that, from a hybrid IT perspective, this show is going to be at least a bit cloudy.

Shtick.

Okay, so we’re going to talk about PerfStack for a second. Before we do, let’s just show them one so that they got a sense for what this looks like visually. And then I’m going to ask a couple of questions. So let’s pull the SharePoint example now. SharePoint never has a problem. It’s never a trick to debug it. You have multiple layers, right? So this one– like, I’ve got my alerts that went across the top. And events here. And then I’ve got all the metrics that are related to the delivery of this application across everything that normally would have been sort of AppStack metric views. But now I’ve actually got histograms and I can see that. So this is a pretty big RC. Right? You had a couple of hundred…

Several hundred.

Customer orders. And then, you’re on the phone with them a lot, talking about how they’re using it. Give me a couple of the use cases for customers that are using this now.

So, a lot of the customers that I talk to, they talk about how AppStack gives them that end-to-end visualization. But what they didn’t get was the ability to see that end-to-end set of metrics, right, to solve those problems. So with PerfStack, they are actually able to drag and drop those different metrics from different systems, from the application to the server, to the virtualization layer, all the way down to the storage layer, and understand really where the problem exists without having to traverse different detail pages within Orion.

And instead of just AppStack, it’s actually the metrics they’re used to seeing.

Correct.

So it’s IOPS. It’s memory. It’s performance for the virtualization layer. So whoever is debugging with it, or troubleshooting, is seeing numbers that they recognize in terms that they recognize.

Right.

And then you had a couple too.

Yeah, well for one of the network use cases, one of the most common is still this core use case of looking at an interface and understanding utilization errors; what conversations are using bandwidth and so forth. It’s sort of like the story of an interface. And you can do that today in the NPM previous RC, but there’s separate graphs to pull the data together. We found some of our RC users are putting all of this data like status, transmit and receive, and errors and discards, and NetFlow information all in a single screen with it. The same shared timeline. They can move around in the timeline. So it’s been common for us.

Right? Well that is really handy in terms of like a drill-down for NetFlow. Because NetFlow can identify a lot of problems, but being able to tie those other metrics to it is really important.

There’s so many slices for NetFlow.

Okay well, let’s go back here. We’ll clear this out and then you guys walk through an example of how they typically troubleshoot.

So let’s do a scenario, right? So I’m going to pretend that I’m a network engineer. Which I have been in a past life, so that’s easy for me. I’ve got a user who has come up to me and told me that WestWeb, a web server, is running slow. I don’t know what WestWeb is other than a web server, so we’ll go over to ‘Add Entities’ and just search for it. See what we get here. Okay, we’ve got a couple of machines. WestWeb01 is the one they mentioned. I’ll add those. So now, I’ve got those entities on the left. I’ll click into one of those and PerfStack will pull up for me all of the metrics that we have about that entity. So, I’m a network guy, right? My responsibility is network delivery–transit. So I’m going to look at Average Response Time.

Okay, so hold on there. That was really cool. So basically, you’re using the tray on the left hand side almost like a toolbox. You search for an entry, and then you click on the entry. It’s going to give you the metrics that are available for that object. And then you just drag it over and let go and it added it into that tray.

Left or right? Entities, metrics, visualization. So I’ve got my Average Response Time. I’m going to pull over Average Percent Loss.

Don’t let go of it yet. So I was going to say, if you drop it into the green, it’s going to stack it.

That’s right.

And if you were going to create a new one, drop it in the blue zone.

That’s right. So you have to decide– do I want to compare these directly in the same chart or do I want to have separate charts? So in this case, I do want to compare them directly in the same chart. Looks like there’s no loss. One point for me. There’s some variation in the delay. So we’ve got a spike here up to 129. That’s a little bit concerning. Kind of going to gloss over that because that doesn’t align with sort of the time they told me the problem was occurring. And that was actually back a ways. So let’s go take a look at this time period.

So this is also time travel for the metrics.

Yes. Okay, so we use– now that’s going to update all the charts immediately. So here, we’ve actually got less variation in the latency. So that’s great. So in general, my performance of my network seems to be okay, at least from the poller to this thing. So the next thing I’m going to take a look at– because sometimes when we have problems with this office, this east location, it tends to be their WAN interface. So knowing that, I’m going to go ahead and cheat and go ahead and take a look at that. I’m not going to assume that’s fine just because of the metrics here looking good. So I’ll add some more entities here– An East WAN Router. That’s what I called it. So we’ll select that and add that to the entity palette over here. So it comes up as a node. Now I don’t actually want to see the performance of that entity. I want to look at its interfaces so I can scroll over here and click this ‘Add Related Entities’ button.

And what I love about this, in this case, it’s actually using the entities that it knows about from the node interface relationship. But if this was– and a little later, we’ll see this with an application– if this was something that was connected through AppStack, it would actually use that to connect those elements.

Yep, absolutely. So here, it’s my MPLS circuit that I care about. This is a sub-interface. So we’ll start with that. We’re going to take a look at the traffic. That’s usually where I run into problems. We’ll look at–I want a percentage. See all of this on a percentage. And I want both transmit and receive. So here’s my transmit percent. Create that in a new one. I’ll get my received percent as well. So my usage looks pretty okay. It looks like there’s a spike here. Now this is interesting because they actually told me that the problem was occurring in the time before this. Something like starting at noon– rather, ending at noon. Whereas my metrics here are starting at noon. So there’s a little bit of that time correlation going on, where it’s like this seems to be the repercussion not the root cause. Okay.

So if it’s choked out then you had a burst of traffic out there.

Yeah, yeah. Yeah. So the next thing I want to take a look at is errors and discards, because that’s what we do. That’s one of the things we check on. And now the interesting thing about errors and discards is they’re logged against the physical interface, right? Because the switch or the router doesn’t necessarily know what VLAN and therefore what sub-interface a frame belongs to, because the frame’s an error in the first place, right? So you can go and take a look at that entity. And what we’ll do is we’ll navigate over to ‘Errors and Discards.’ And we’re going to line up our ‘receive and transmit’ the same way that we did with this MPLS circuit. So we can compare straight, in the same direction. If that makes sense. So we’ll take a look at errors. I want it in a percentage. I’m trying to compare percent to percent. So I’m going to do ‘Receive Percent Errors.’ I’ll drag that on to my ‘Received Percent Utilization.’

Now if you stack charts where one of them is a metric and the other one is a percentage, you can still actually do that in a single view.

Yeah, yeah. That’s absolutely true. So it’s just sort of, how do you like to understand that visualization? So as that loads up for us here, we can see that the percent error is negligible. It’s nothing, right? It’s zeroes across the board.

So the physical interface is not a problem.

So it generally looks like my transit is fast. It’s not dropping packets. And the most common error that I see, which is either high utilization or errors on my interface, that’s not happening either. So I’m not really seeing any indicators that this slowness that the user reported to me is a network problem. Now I don’t know a whole lot about systems, but I’ve got the node right here. I’ve already found the server. So I’m going to take a quick peek. Now as a network guy, I don’t pay attention to processes and a whole lot of the deep metrics and queues and all that junk that you’re aware of. I think about usually just CPU and memory. When those are high, problems happen. We’ll take look at CPU and memory. It’s written right here. I’ll pull over ‘Average Percent Memory Used.’ We will also grab ‘Average CPU Load.’ Right away, I see some high numbers going on. Ooh, shoot! Okay. So we’re at 100 percent, 90 percent CPU utilization for a sustained period of time. Our RAM looks high. And this actually does correlate. So this actually is– I’m going to actually get this out of the way. So we can get these a little closer.

So you’re just doing a little housekeeping, trimming that?

Yep. And so we seem to be having a problem here. This aligns with when the user reported the problem. And this network issue seems to be a repercussion of this. Maybe he’s spending all of his CPU cycles making a mess on the network. So I’m going to pass that off to my systems guy who we have right here.

Okay, but before you do that…

It makes me happy.

It does make you happy. And it’s never the network, right? But how many times do you have one where you transition that issue to someone else on the team? Maybe you have the systems team, and the first thing that they do is then question the metrics that you used to say this isn’t a network problem. They’re convinced that it is. So in this case, you’d just come up maybe and grab this URL, put it in the ticket or throw it in Slack or whatever else. The very first thing that the other team is going to get, when they pull it up, is the evidence that it’s not the network. So it lets them go right to work on troubleshooting the actual root cause instead of them questioning are you sure. Because I know that thing’s been a problem. And I’ve seen a lot of errors. You can actually see that in this example right here.

Yeah, pass the data.

All right, just pass the data with the context to show that you’ve actually ‘troubleshooted’ it.

Yep.

Yep.

Okay, so we’ll do-si-do over here and we pull this. You got a Slack notice. Click on it, and it opens up this exact same view.

That’s the great thing about it. I can actually prove that this is going to happen by taking that URL, reloading it in a different browser.

Okay.

And then.

The only thing.

There’s the data.

And then the only thing that’s different is that the drawers collapsed.

Right. So at this point, he’s told me that– my network guy’s told me it’s not the network.

Looks like not.

And he’s even provided some evidence.

And he was kind of ‘chippy’ about it.

Usually. All right. But the one thing that he didn’t really highlight is what’s the root cause. He simply said, “Hey, I found out that it’s not the network. It’s the system.” So now, the real work can actually begin.

And he said that CPU thingy, that he doesn’t know anything about that.

Yeah

A CPU.

It’s doing something strange.

It’s a problem.

So you’re right. That is a problem. A lot of times, when we’re trying to figure out what’s going on with the response of a web server or something like that, where we’re looking for some semblance of problems.

Right. Or it’s not, in this case, maybe it’s not a problem but it’s a set null. It is telling me that something changed.

Right.

On that system.

Right. As you were mentioning earlier, if we want to go find information that’s related to this, just as Chris was able to go and find information that was related from a network perspective, we can do the same thing from a server or an application perspective. We use those AppStack relationships. So all we have to do, just as Chris did, is click the ‘Related Entities’ button. This is going to go through, find all of those relationships that we know about, from application all the way down to the end of the stack, and bring all that information.

And of course, you get a lot more than you would with the network because you’ve got all the layers represented here.

Right. So I’ve got the server information. I’ve got virtualization layer information. I’ve got application information. I even have end-user experience monitoring information from…

WPN.

WPN, yeah.

That’s very cool.

And then everything that I have, that I’ve understood with the relationships within inside of Orion, I can pull up all of those related entities.

And the group membership here. This is an Orion group, right?

Correct.

So it makes it easier if you were troubleshooting regular systems. You can almost use it as a short cut because it’s going to be a specific subset of systems, and you can expand everything as part of that group.

Yep. All I have to do is click this button. Now I can find all the related entities associated with the group.

And custom properties might be a great way to do dynamic groups to associate systems.

Perfect example! So we know that there’s high CPU. Memory looks like it’s trending up. It may or may not be a problem. We don’t necessarily know. And we can take a look at baselines and understand kind of where we sit from that perspective. I need to dig more into the layers. I need to understand what is the root cause of this high CPU utilization. So I can come in. I can take a look at my virtual machine information. And look at virtualization specific statistics.

So this is coming from Virtualization Manager data?

Correct, correct. What we have here, from the node perspective, is Server and Application Monitor. What we have from the virtualization layer is Virtualization Manager. We come further down into transactions; that’s Web Performance Monitor. If we had relationships for storage related entities from Storage Resource Monitor, we would have that in there as well.

So if I had come into this app, via sort of an app open ticket, I would have seen a mini stack view. The miniature AppStack view. And it would have had many of these same elements.

Right.

Or actually, it would have had the same elements.

The exact same elements. So I want to actually take a look at the load as it’s represented on the host, for both CPU and memory. So we can see what is the actual load on the virtualization host. We could dig into the virtualization host and see if there is someone else consuming a bunch of resources.

Right. Straining?

Noisy neighbor.

Correct, correct. In this situation, we don’t have a whole lot of overall utilization on the host. We can confirm that by coming in here, taking a look at the Average CPU. And I’m just slowly building up my palette of metrics. I can look from where we were looking at the network information. I can come through to the rest of the CPU and memory from the server. Look at it from the virtualization perspective, and then go into the host and see, ‘Do we have some type of resource constraint at the virtualization layer?’

And the other thing you’re doing here that I like is that you’re adding the new ones, in terms of this being sort of a laboratory for experimenting with data. You’re dragging them over to the top. So you’re literally, ‘top of mind’ is for me, I sort of think of it as ‘top of stack.’

Correct.

Where I’m debugging.

Yep. I’m not finding any ‘constraintive’ resources here, from a CPU perspective, on the virtualization host. So it doesn’t seem to be an infrastructure stack- related problem. No root cause yet. But I still have more information that I can dig into. So I can actually come into the application itself. And I can start looking at application-related metrics to understand: Is there a problem at the application layer that actually, truly understand, you know, uncovers the root cause?

And in this example, you are both responsible for managing the virtualization infrastructure as well as the app. A lot of times, you might have someone else– either some of the dev team or someone on the application delivery team. So right here, you would have done what you did before, which was copy and paste this URL.

Right.

Send that on to the next person on the team. And again, they’re going to have all of the accumulated metrics that you guys have both contributed.

Yep.

To then start working on the app.

Yeah, that’s a perfect example. The network guy took a look at the network related metrics. The guy who’s responsible for the operating system and the virtualization layer took a look at that. Then he’s passing it on to maybe the application owner to understand what’s going on with the application.

Right.

So in that situation, I’m going to look for maybe connection-related information. How loaded is this server with connections? I can take a look at ‘Total Connection Attempts.’ I can look at this from both the total web server as well as the individual sites themselves.

Yeah there’s the sites. That’s cool.

So I take look at the different sites that are having problems. And I see one isn’t very loaded at all. The other has a substantial amount of load. One thing that I can notice very, very quickly in terms of this, is I have an understanding of how loaded I can allow my web server to become. And I can immediately point to we’re probably at a max peak capacity here with the application itself. So maybe we need to go through, add another web server to this.

And redistribute the application.

Yeah, and redistribute, right? So I’m overloading. I can go take a look at my baseline data for here as well. Validate that I have a certain amount of connections that I’m willing to sustain over time. And find out, ‘Hey, this truly is overloaded.’ I need to make sure I rebalance my web servers.

There was an example from Cisco Live in Berlin where a customer was actually identifying a couple of things. One of them he was using HP Bindings to look for D-DOS, right? Just the total number of connections– I’m sorry, connection attempts. Then he was actually using his HBS Bindings for significant errors, right? So an expired cert that comes out on HB or HBS– they get an error. And then you get another connection attempt. So he was able to pop both of those out.

So I think from this perspective, the network guy and the systems guy were able to dig into this, understand where the problem wasn’t. Understand where the symptoms were. And then really, truly dig down and identify where the root cause is with the problem.

So Steve, if you would, let’s just add a transaction monitor here that’s going to give us the average response time for this application.

Okay. Let’s look at our transactions. We’ll click our entity. One thing, if you notice– I was searching previously for connections. I need to make sure that I clear that out so I can find the related information that I’m looking for. So in Web Performance Monitor, I’m looking for response time. I’m looking for information that’s coming back from that website.

And that’s a real page pull?

Correct, correct. So with that, I’m going– I’m recording actual steps that I’m going through on the web page. And then I’m measuring the time that it takes for all of that to happen. So I can look at average duration. I can look at the layer in ‘max’ and ‘minimal’ along with that as well. And then I can see, what was the time it took to actually get a response back from those pages.

And go ahead and add the alert because we’re here in an alert state. Okay, so what I really like about this– and this is something that just last week especially at Cisco Live– was I realized that one of the problems that this really helps solve is, in this application, there’s a lot of moving parts. Where’s the problem?

The problem’s right here.

Yeah the problem’s right there. Now if you ask someone who is trying to remediate one of those root causes, they would say like, “Let’s say it’s a LUN performance issue.” And say, “Oh, the problem is my storage is low. Or my VM is overwhelmed. I need to reallocate VMs on that machine.” It’s a noisy neighbor problem. That’s not the problem. The problem is this. This availability issue and the performance round trip times is what’s driving the help desk ticket, right? So from the user’s perspective, this is the problem. And I recommend any time you create a PerfStack view, make sure you have that top level metric of what’s driving the help desk ticket, and leave it at the top. Because everyone else on the team is going to be using all of these other metrics to remediate the individual system problems, right?

Right.

And you assume that if I address the issue that’s causing one of the subsystem spikes here, that this is going to improve at my top line. But if it doesn’t, and how many times do issues actually have more than one cause, this isn’t going to improve. That’s going to tell me there’s something else. That stops that problem from where a user opens a ticket, you say, “Oh, I know exactly what the problem is. Here’s the root cause.” I remediate it. What do you do? You close the ticket. And then what happens? It doesn’t get better because there’s something else that’s a contributing factor and they reopen the ticket. This allows you to really work through that so that the teams are looking at the metrics they need to address systems. Their systems are also looking at the thing that opened the help desk ticket in the first place.

Always paying attention to the top line.

Yeah.

Literally.

Yeah. All right now. I have to, of course, hack this. Because I have to. So let me get in here for a second and let me show you a couple of things. One of them is it would be really nice, a lot of times, if you get an alert, especially for a recurring problem, to get the PerfStack view related to that issue as a part of the alert. Really handy, right? So there’s a couple of different ways to do that. I’m going to use a different system here that’s just really super-duper simple. I’m going to come back over here to my Performance Analysis and hey, you’ll notice that this menu is simpler. So it’s actually configured vertically instead of horizontally. So I’ll come back to Performance Analysis and I’ve got one that I’ve already saved, just as an example. So then, this is literally– I just set this up a second ago. We’ll call this Lab Test 1. How original, right? So this one is, basically, CPU utilization response time and packets transmitted on an interface. So a little bit more kind of network-based. This also, I love this one. Because this is a multi-value stack chart.

You have to prove it’s not the network first.

No, you guys proved that it wasn’t the network and it was the application. I don’t have to prove that. And I do like to say that it’s not the network. Okay, so this is one– let’s just say that I’ve set this up because I’ve got one system that just throws a bunch of errors. It is doing bad things to packets and I would really like to be able to watch it over time. One of the things you’ll notice is that the URL is a little bit different up here at the top because I’ve saved it. And it’s a shorter version of it. So basically it’s– I’m willing to bet this is like pre-saved something or other. And then that’s effectively the ID. And the cool thing is, as you edit these things, that ID, once you save it, won’t change. So you, as a team, can evolve these things over time. But what would be nice is if I could set that for a particular node. Well, here’s how we’re going to do that. The first thing that I did was I added a custom property. Now the node that this guy is on is down here. So I’m going to come over to this lab exchange hub and let’s go take a look at that thing. I’m looking at the Summary View page for this node, right? This thing is in alert state. And if I scroll down here, you’ll notice that I’ve added a custom property, in this case, which is ‘Alert Saved per Stack.’ And look at what I put right in there.

That’s a good ID.

{Patrick] This happens to be the ID, which mapped to that view.

Getting very hacky.

It’s very hacky. But let’s say this is an ongoing problem, and I’m watching over time. I’m trying to catch it in the act. And I want to be able to dispatch that to somebody else. I’ve got that. That tells me that for this node, this is the view, the PerfStack, that I’d like to have. And it’s always going to be active. Now, what I need is to actually capture that and send that as part of my alert. You can probably see where I’m going with this. I’m going to go over here and we’ll take a look at our alerts. And you will notice that in my alerts, I have created one. Got to put a ‘bang’ in front of this.

“Lab Alert,” yeah. “Lab Alert 1.” Keeping that simple.

Yes, it says alert me when that frigging server next door is throwing errors. Let’s take a look at what that thing’s going to do. It’s pretty basic, right? If I walk through this thing, let’s take a look at what’s filtering on it. So this one–the trigger condition is if the node name equal to that has an interface that has a bunch of receive errors greater than 10. Then it’s going to fire. And what it’s going to do is set the trigger action. And I’m going to include that URL for that ID. Now, I could’ve just put the whole path in there but that doesn’t make any sense. I just want the ID, right? So basically, I’ve said “PerfStack here,” and then I’ve got my path: UI/PerfStack/PSTK hyphen. And then right here is the variable that’s actually going to pull that value out and pop that right in.

So you’ve done this for one node. But because you’re using that variable, the custom property, any time you fill that custom property, this alert would populate that as well, right?

Yes.

So it could mean many nodes.

It could be many nodes. And this is a simple one– to your point, like we talked about before– grouping by group. So maybe I want to reuse one. Or I had multiple applications running on one set of infrastructure and I’ve got a lot of variables are the same. But to your point, some of them might be different. So how could I dynamically do this? Ahh, well this is where it gets really fun. Remember this is based on a URL. If I look up here, this is one I was working on interactively. I haven’t saved this one. If you look across the top, and you’re familiar with SWIS– not that I would talk about SWIS– you will notice some things that look awfully ‘SWISy’ up here. So basically, like the instance of the server, underscore Orion, then the object type, then what it is. And then that’s associated with a metric that has a name like, in this case, InterfaceTraffic, dot, InMax, MulticastPackets. So you’ll notice a semicolon right here. That is actually delineating one where there’s a stack and then there are commas separated inside of that. The one metric, like this multicast packet traffic, comes right here to this comma. So that is effectively one component in a PerfStack. If I know, for example, that the node interface ID is one of the things that’s dynamic, in this case it’s underscore 24, well I can make this thing dynamic– to your point–for any alert of a particular type. I’m going to come back over here to my alert definition and I’m going to paste this guy in here. And then as you know how to get variables, I need the node ID. So I’m going to say ‘insert variable.’ I’m going to go up here. It’s just global for this one because I am technically on interface. But I’m going to need the node ID. And I’ll say ‘insert variable.’ And here’s my node ID, right? So that’s the node ID. It’s a SWIS entity and then a node ID. That’s going to break out just the number that we needed. So I’ll say ‘control C’ on that and I’ll come down here. Everywhere we see the node ID, right here.

When this alert triggers, it will know what node is triggering the alert and then this variable will contain that and you embed that right in the URL.

Yep.

And then now, you’ve got all of the metrics that you need, for that particular node ID to solve that problem that you’re familiar with solving.

Right.

And it’s been embedded in the alert URL.

It’s embedded in the alert email and in this case, because I need to go add it for the interface ID. You can see that over here. It’s 241. I would do the same thing that I just did for inserting that. But it means that I can apply this to any number of different types of nodes or applications or anything else. And when that alert gets triggered, how many times do we talk about reducing friction of people responding to alerts, or reducing alert spam, or getting people to take action? If they are actionable, they are more likely to do something about it and actually be happy to receive information that something about their system is bad?

Right.

Well in this case, they get what looks like a customized problem view that was raised by that alert. And also once the alert is raised; it’s going to be in the alert itself. So if I’m not even receiving it, if I’m just going through those past alert actions like we saw across the top, I can go and look at those views in time and be able to see everything that I need right there.

Yeah, it’s one of those unsung powerful capabilities of PerfStack that we intentionally put in there.

Anyway, it let me hack a little bit. Awesome. All right, Steven. You’re up. AWS monitoring in SAM.

So it’s a pretty cool feature.

It is a pretty cool feature. And it’s funny because you guys have been asking for this for a while. And we said this is one of those things that is tough especially at SWUG because they said, “You know, I just got virtual resources and they ought to just look whether they’re at AWS or in my VMware infrastructure. They ought to look kind of the same.” And I knew we’d been working on this for a while, and I had to look at you guys and say, “Uh, yes. What are we working on now in THWACK? You should check it out.” And finally, it’s available.

So now, we’ve got it in the release. It’s really, really easy to get to. You go to ‘My Dashboards.’ Look for ‘Cloud’ underneath ‘Home.’

I do like the fact that once you’ve got some time invested in a PerfStack view, it asks, “Are you sure you really want to nav away?” Because every now and then, you might not want to.

So this is our Cloud Summary page. What we’ve done is we’ve gone directly to AWS. We’ve queried publicly available APIs. And we’ve gathered information about the state and the status and other configuration information about those cloud instances. And the great thing about it is we’ve taken what we understood from what’s running in the operating system, so the OS metrics, the application workloads– we’ve married that together. We want to show that complete 360 view of what’s going on at the cloud layer, as well as what’s going on in the individual instances that people are running. And we came up with this capability.

So before we would have recommended, or I recommended on several occasions, what you did was you either use an agent or maybe you pull directly and you treated the OS itself. Like the EC2 instance, just like a regular node– and you pull it as a node.

Right.

But what you’re saying is you’re connecting that with everything that’s coming, I guess, out of the CloudWatch API.

Right.

So that you can recover things that you didn’t have before like visibility for interfaces and volumes. The performance, it becomes opaque when something is in the cloud.

Right, exactly. And you kind of hit the nail on the head there. We have several customers that have been monitoring cloud resources for quite some time now. So what we needed to do was be able to take and understand what they were doing with Orion today. And then merge that information with what we were going to collect directly from AWS.

Well, or even one of the things that was interesting this year was I talked to maybe a couple of dozen of you guys that actually migrated your entire Orion platform into AWS or Azure. And the thing that’s really interesting about that, and I’m going to use air quotes because I’m actually quoting someone from VM World. But he said, “Yeah, well what we do is we migrated the Orion server and then we just put a remote poller in our legacy data center.”

“Legacy data center?” Huh.

And I reminded him, like, legacy data center? You lived and breathed that data center for years. He’s like, “Yeah we moved past that. Anyways, so we just monitor it like anything else.” Anyway, that’s good.

Well good. So what I want to show really is how we present the cloud specific views in relation to the Orion detail views that you’re very, very familiar with. So when you go to the Cloud Summary, you’ll see several different resources that give you insight into your cloud environment. In the top left, you’ll very quickly see the Instance Summary. The whole intention here is this is all of the resources that you have running– all of the cloud instances that you have running with any of the cloud accounts that you’ve configured. You could have multiple configured. You could have one configured. All you really need is the right credentials from your cloud account. We give you the information on what permissions that you have to have. And then you set those, enter those into Orion, and then we will discover all of your cloud instances.

You know what we ought to do? I’ll throw those up as a note because I had that same question before. So I’ve got the IM role access definition. It’s only like four or five lines– to give you read only access. And then you can set that up. And definitely do not use a direct login. Create an IM role for this so that you can tell your cloud team that you are not involved in anything else in AWS. If you would expand over here on the cloud server infrastructure, the interesting thing about that– and just break out one of those availability ranges. It will actually populate this for you. You don’t need to. So depending on the visibility of that IM role, you can actually have multiple IM roles so that you can actually cross multiple separate instances. If you’re truly a hybrid cloud, you have different accounts. You can pull all those into one aggregate view.

Yeah, you can see everything. Many people have instances running in different regions. They’re running in different availability zones. We can present all of that information right here. And you can get a complete understanding of what are all those resources that you have running, so you can keep track of them.

Infrastructure as a service. Platform as a service is a lovely thing.

It is. It is. So one thing I want to highlight in here– again, we’ve talked about that marriage of the AWS data that we’re collecting and the data that Orion’s collecting from the operating system– I want to show you how we really marry that together. So if you notice in this resource, we have a set of metrics that may or may not be available. The reason why is there’s some data that you can’t necessarily get directly from the CloudWatch APIs. So what we can do is that any of the data that we’re collecting directly from the operating system, we can overlay that into this view. So one of the things that you’ll notice is, if there’s not a metric available, we’ll let you know, hey, this isn’t available. But you can manage this as an Orion node and then start collecting that information where available. So you see here, we have one instance that is managed as a node. We’ve collected all of that data. We have another instance that is not managed as a node and there’s some data that’s potentially missing there.

And like this Chef server right here would probably be Linux. So you would be using Linux Agent to get that information.

Right. Now the other thing that’s very interesting is, if I open up one of these details, again you’ll see some very familiar set of resources if you’re an Orion user. We’ve got configuration information. Notice again, there’s region. If it’s part of an auto-scaling group, we actually provide that information there. We can see additional configuration details. What type of platform is it running on. What AWS type it is. All of that information that is important for you to understand what is this instance doing and running.

I’m using it just to identify systems and for access like–scroll it– I can scroll here. Just like for example, how many times do you have to– are you in the bad habit of logging in to the console, the AWS console, to figure out what the DNS name is to get to something, right?

Right.

Well, here I can see that directly. So I can take a whole set of hands out of that console that would normally log in for even basic information, just inventory information. It’s available here, so it allows me to kind of safely delegate a little bit of that management to an additional team.

So that is another one of those intended purposes we have, right? We want to keep everybody within a view, right? That they can do their work in and not have to traverse a lot of different consoles to get information that they need. So again, you’ll see several different resources that are very familiar, right? We have CPU Load. We have Network Utilization. These are all very familiar resources. This is information that we’re pulling from CloudWatch. Now what’s interesting is if we’ve actually managed this as a node. Let me open this one up, and you’ll see something else kind of interesting about the difference. We went directly to the node details page here. So again, this is something that Orion users are very familiar with, in terms of node details. This is an Amazon Linux AMI machine type. You know this is all Orion node-based details. What happened to my cloud details? Most people may or may not notice that we have a cloud tab sitting over here on the left. So just like in the other instance that we saw a moment ago, I had all of that cloud-related data that was available. It’s available even for those that are managed as Orion nodes. So again, this comes back to we want to provide both sets of information together as one, so users don’t have to traverse a whole lot of different views to be able to grab that information.

And I can correlate that in alerts and reports and anything else.

Correct. Correct. You know, anything that you need to understand about this instance that’s running so you can troubleshoot application workloads is all available to you here, right? I’ve got the cloud infrastructure as a service details layer within my cloud tab. I’ve got my node details layer here within the summary view. And if I have applications running on this, I can actually see that information as well. So with AppStack-related information, I understand what my server is. I can even understand what applications I have running on that cloud instance inside of AWS. That is, in essence, our AWS monitoring feature.

Wait, wait, wait. So don’t underestimate what you just said there. Which is AppStack, which is very powerful. We’ve been using it for a long time– now includes AWS data!

Exactly. So, as long as we’re managing this as an Orion node, we have the ability to understand.

Got it.

Again, from end to end, what’s going on there. If it’s just monitored as a cloud instance, you know we’re just pulling that API data from AWS. We don’t have the relationships established. The moment you monitor that as an Orion node, now you have all of the information that is available and those relationships that we can create with Orion nodes.

Well, especially if one of the reasons that drove you to cloud was increased availability or failover, because it combines data for multiple availability zones. It means, in that AppStack view, you could actually see what is essentially distributed HA or DR network where you’re using cloud to provide that capability.

Right. Now, one thing else that I want to show– We were talking about PerfStack earlier– one thing that we made sure that we provided in here was, you know, all these metrics that we are collecting from AWS. Those can help in the troubleshooting process. We also wanted to make sure that that would be available in PerfStack as well.

Yeah, because when it is the network problem that will affect your cloud-delivered applications.

Exactly. So when you’re in PerfStack, if you look for a certain type– I’m looking for exchange applications– I’m looking for different types of devices– I look for my type. There’s a cloud instance-type that’s available here. So if I select that, now I have all of my cloud instances as an entity that I can add to PerfStack. So I’ll add one of those in here. And then that set of metrics that we were able to poll…

28 metrics. Okay.

Exactly, from AWS, from CloudWatch, is now available that you can plop that on PerfStack. I can look at things like availability. I can look at average CPU load. All of those basic network metrics that are available.

But what I was actually going to say, just like basic network grades or IOPS performance that you would normally get for storage, those are the first things that sort of disappear when you push something out to the cloud, because I can’t get that information. Well, CloudWatch knows.

Exactly.

But then, if you want to go pull that all together, that can actually be complicated. So here it’s just pulling it for you from the API and makes it visible again.

Right. And because we’re monitoring this as an Orion node as well, I can use those related entities and I can go find application data as well as any other relationship data that we have within inside of Orion.

Awesome. So the only thing that I need for this is SAM. It’s in 6.4. It’s just part of the upgrade.

And one of the great things is we have it in VMAN 7.1 as well.

Ooh, nice. All right, Chris you’re up. Mister One Word. And that word is?

Meraki.

And I have to say I spent a lot of time in the Meraki booth at Cisco Live in Berlin, so much so, talking to them about APIs, that they gave me a couple of pairs of the Meraki socks. Thanks again, guys. [Chris laughs] That was really nice. So yeah, this has been a feature that you guys have been asking for a while, and I wanted to just show real quickly how that works so that you can get it configured.

Yeah, I mean Meraki is super cool technology because it’s all cloud managed, and it makes things like zero-touch deployment really possible. So, really cool technology. The challenge for us was all of the information that we needed, that you guys need in your wireless views, wasn’t available via SNMPs for the traditional, one could say, ‘legacy’ protocol to get all of this management information. But it was available via API. So we worked with Meraki to get that information right into the interface here. So what do you do when you want to add a node? You go to the Add Node Wizard, and the same holds true for Meraki Wireless gear. So here, if we scroll down, we can see the polling methods, external status, SNMP, WMI– I don’t even know what that is. And Meraki Wireless, right? This is API-based polling. So if you select Meraki Wireless, a couple of things happen right away–one is dashboard.meraki.com is your target. So we’re going to poll to the cloud, right? The next one is it asks you for your API key. So if you go to your Meraki dashboard, it will provide you with this API key. Just a long string of letters and digits that’s going to act as your sort of unique identifier and your password in one. So you paste that in here. You get the organization list. It will list your organizations as you’ve configured them in Meraki and you select one and move forward. After that, it’s pretty much the same as you normally would see. In ‘Add Node’, you can select your APs, and from that point on, we can pull in the APs automatically as you zero-touch deploy them, thanks to Meraki.

So basically, the organization replaces what would have been the controller for Thin Clients.

That’s right. That’s right. So it’s all cloud controller. We’re just going to poll that thing and organization is they key there. So once you’ve done that– so this was actually a really interesting thing. Because we had some customers asking for Meraki Wireless, Meraki Wireless monitoring. And I went and talked to them, right? And I was thinking, “Well, Meraki Wireless, it has some wireless things but fundamentally the technology is a little bit different in several ways, right?” It’s cloud managed for one. Some of the information’s a little bit different. It feels a little bit different. So I expected users to ask for a different view. But they were very clear. All of the users, it was unanimous that I talked to, were very clear that they want the same information. They just want it from Meraki too. So, particularly for organizations that are moving to Meraki and big organizations, this could take years, right? These access points are distributed all over the US, all over the world. So you need to have this, your old legacy, traditional wireless controllers and your new Meraki all in one view.

In a blended view.

That’s a common theme, I think.

Yep.

We saw that with the AWS monitoring. People want to see things…

That’s right.

The way that they’ve seen them before.

That’s right.

The same thing holds here.

And the meta-story there is, with hybrid in general, we’re going through a transition where there’s a lot of different extents that we’re migrating technology into. And you can’t just rip and replace and cut over.

Yeah.

You need to actually be able to evolve, because it is an evolutionary process that we are all going through together as IT.

Yeah. And evolve means identifying where you need new information and new visualizations, but also identifying where you had the right ones. So you have to use both.

All right. So what’s it look like?

So once that’s done, we just populate the views that you’re familiar with already. So if I go to Summary, I can see Meraki Networks. I’ve got my controllers right here. If I click on one of these guys, it will bring me to the Wireless Controller view. Show me my access points. Now some information doesn’t make sense to have here, right? Your average response time, your packet loss up to the dashboard controller doesn’t really matter here. So that’s not included. But your access points totally are included because that is important. Most notably, the Wireless view is populated. This is where most people go.

They understand it.

Yeah, yeah. So I’ve got all of my access points that this instance is monitoring, shown here. And we can see there’s a bunch of Meraki lab access points. And we’ve got clients. So this is one of the key pieces of new information that’s available from the Meraki API that was not available via SNMP. So we’ve got the list of clients; turns out that’s important if you want to monitor wireless. We’ve got their identification information in terms of IP address, MAC address, how long they’ve been connected, the data they’re sending and receiving. So really, a lot of the data points here you know and love, and you know how to use and act upon. So that’s great. Now there are a couple of pieces of information that are missing. So as Meraki builds up their API and makes more data available, we’ll look into adding those pieces of data as well. But we’re really excited. It’s just, yet another situation where we’re moving forward with API-based polling. So we do UCS, we do ServiceNow integration, we do AWS, we do Meraki, we do F5. There’s just lots and lots of API data. It’s polling all of this data together to show the complete story.

This is an example of why we prefer to use APIs any time we can, right? We use SNMP for network monitoring because it’s ubiquitous, and there are a lot of systems that just don’t offer anything else. But even UCS, for example, offers an API. And APIs are just much more stable. They’re richer in a lot of cases, and in this example, give us things that we can’t get any other way. The other thing that was really interesting– and it was actually one of you guys at Cisco Live Berlin that brought this up– was that he has been using the Meraki dashboard. They’ve transitioned like half of their APs over. And then, they’ve still got some old thick and then everything else is thin in their environment. And he, when I showed him this, he’s like, “Well, I know what other app I’m going to get to uninstall.” Right? So, he’s going to get away from that one last dashboard because it’s just integrated now along with everything else.

Yep, absolutely.

Okay. There is one more thing that I would like to show, just because.

By all means.

All right. You guys have been asking for this for a long time, and I just wanted to show this right quick. What do you see? You’re looking at this Node Details page. Which is normal stuff. We’ve got node details. We’ve got data, IP addresses, we’ve got our management interface. Anyone see anything new?

Where did that ‘Manage’ go?

Where did that go? [Patrick inhales deeply] Well, what is this thing? ‘Maintenance Mode.’ Ahh. We’re going to click this and drop that down. I can unmanage it and remanage it right from here if I wanted to. But better– ‘Mute Alerts.’ So continue collecting data, just do not send me alerts.

Yeah, it turns out, you know, when you’re using ‘Unmanage,’ you’re probably in a maintenance mode and you’re probably wondering, “What is my monitoring thinking of this system?” You probably need data about that system to potentially troubleshoot problems. Most important time to have data, you don’t want that data all gone.

Right, you just want it to *******.

Yes, stop alerting me.

Stop bothering me.

That’s right. Or what I really like is what does schedule do? Oh, well we remember ‘Unmanage’ for schedule, but if I want to, I can now just do a ‘Mute Schedule’ for a maintenance window. Because I would want to know is it up/down? Can I go ahead and do verification that the changes I committed are working the way I expected? I want to gather data all through that process. Just don’t send me alerts because you’re in maintenance mode.

Easy enough.

Very simple.

How did I know you were going to sneak in ‘Mute?’

Because they’ve been asking for it at SWUG. It’s finally in Orion so, tah dah!

“Tah dah” is not shtick– but it is silly.

Okay. Well, what wasn’t silly was you guys passing PerfStack views back and forth, and building and transforming it as it shifted from team to team.

The only downside is it wasn’t the network.

Nope.

In that example. And hopefully, you also got a sense for how AWS monitoring works. And I think you can see where we’re going with that.

What? That it’s integrated into the platform? It’s not a separate module, and that it’s going to participate in more and more services like PerfStack?

Something like that.

Yes.

Meraki.

Ah, Meraki. It is the new hotness.

Didn’t you go with them in your house?

Dude, I begged every which way for their sales team to come up with some sort of lab-okay pricing, and could just never get there. I am however, still super happy with my Ubiquiti solution.

Your house needs multiple access points and controllers?

No, it doesn’t. He’s needlessly over-provisioned.

Ah, that is both true and terribly sad. Anyway, we hope this episode got you ready to create your first PerfStack views and share them with your team. Now, we’re going to have a ton of information on the Customer Success Center, more webinars and training videos rolling out as well. And as always, thanks again for making SolarWinds Lab so easy to do. Your suggestions and show ideas really do drive the show. So please visit our homepage, which is lab.solarwinds.com to catch up on previous episodes that you might find helpful. And be sure to sign up for the ‘Live Lab Event’ schedule where you can chat with us IRL.

Are you allowed to use IRL?

No, he is not. For SolarWinds Lab, I’m Chris O’Brien.

I’m Patrick Hubbard.

And I’m Steven Hunt. Thanks for watching. [Bright electronic music]