In this episode, Head Geeks™ Patrick Hubbard, Thomas LaRock, and Destiny Bertucci dive into the automation of mapping within SolarWinds solutions.
That video can only mean one thing, we’re adding another Head Geek. Please welcome our newest Head Geek, Sascha Giese.
Welcome, welcome to the team.
Hi, thank you.
It is great to have you. Let’s do the honors. As our former newest Head Geek, Destiny.
I would love to.
Destiny, you’re graduating.
I know, I’m so excited, I finally grew up and graduated.
Yeah, and now we got somebody new to clean the server room.
That is right.
That’s right, and so Sascha joins us actually from our European headquarters in Cork, Ireland.
Where I started more than four years ago.
Right, and that’s, well, the accent I recognize.
That’s what I got too.
It’s totally Irish.
It’s a little bit of that, and a little bit of German.
Well, just a little bit.
Yeah, just a little.
Yes, Sascha’s been with SolarWinds for almost five years now, he’s worked with lots and lots of you especially in Europe on the phone, and we are just thrilled to have you.
I think there’s going to be a lot of opportunities to actually see you, where are you going to be first?
Oh, my first appointment would be in Dubai at the GITEX in October, I think 14 to 18.
And don’t forget, you’ll be with me, VMworld Barcelona, that’s the fifth through eighth of November.
And, he’s going to be with me at the SWUG in London November 13th, and the SWUG in Frankfurt November 15th.
So, basically, we all just get to fly around and go to Europe and visit Sascha.
That’s what I got out of it.
Is that a bad plan?
That was the whole point, we were going to…
Guys, guys. Sorry, for me, as a German, it’s important that things are in order, okay? And I think we are in a Lab episode right now, aren’t we?
Shame on you.
Welcome to SolarWinds Lab, and today we’re diving into something you’ve been asking us to spend little more time on. Mapping. And in particular, how to do mapping automatically. Joining me today are a couple other Head Geeks, Thomas LaRock and Patrick Hubbard.
Well, I was in the area, and I already have the lab coat, so…
Yeah, but you don’t have to wear your lab coat at THWACKcamp, and that’s coming up in a couple weeks.
That’s true, but this year they are going to make me wear pants.
Okay, yeah, well THWACKcamp will be fun, so make sure you register now, especially if you like THWACK points, we’ll throw in a link.
Yeah, it’s thwackcamp.com. But, yeah, you’ve been talking to us for a long, long time, like at Cisco Live! and other events, and they have asked us over and over again to show you some of the advanced features that you might not know about, but maybe they’ve been in the products you’re using for a really long time.
Right, and I spend most of my time working with databases and virtualization, and even if you limit maps just to those two, it’s still helpful, but I really wanted to be a part of this session so I can learn more about the mapping feature for the other products.
Ah, but we’re not going to limit to just that. Network maps, application maps, virtualization map, cloud maps, they want to know about mapping, as in how many things can we get maps automatically?
Okay, so if you’re going to show network mapping, and I’m going to do mapping for apps and virtualization, and then Patrick is talking about…
Cloud, sort of. I mean, don’t really think of it as cloud. More sort of how do I map things with lots of moving pieces? Sort of, distributed applications. And that’s things like tracing and some of the other tools that maybe you’re familiar with, like AppOptics, but how to use them in collaboration with the Orion Platform and a couple of other things, and I’ll take a couple minutes, we’ll talk about that at the very end.
You know I’m all about automating network maps. It’s just so handy. And I’m always on board with tracking down and proving, it’s always the application and never the network.
Now, that is not true. Well, except when it is true.
We’re about to find out. So, why don’t we start where everyone always seems to: the network.
It’s the network.
[whispers] It’s not the network.
I beg to differ. And I think I can prove it. I know you came here for a how-to-do network maps. Let’s start by talking about some of the new mapping features you might not know you already have. [electronic music]
Okay, so when we’re talking about mapping, I always love looking at the NetPath itself, actually.
I never get tired of looking at NetPath. It’s amazing, being able to figure out mapping, essentially mapping the internet, or at least the part that maps to a path, really, really helpful. Of course, what I think you like the most is network maps.
Definitely, being able to visualize where the network traffic is going and being able to pinpoint into it and event correlate, that’s something that I actually find fond.
And what I love is, it’s inventory for me. It’s basically taking all the goodness that makes AppStack work, and pulls it up here so I can quickly navigate. But before we dive into screens, let’s don’t do that just yet. And I promise, we are going to spend all this episode doing how-to’s on how to use these features. But let’s take a minute and just talk about mapping, sort of the fundamentals of mapping, and why do we care. So, why do we care?
So, me, it’s visualization the first alerting. So, I like to be able to know what is connected on the infrastructure. If I’m going to help build a better network, or to even try to have things be more secure, I need to know what is relying on my network and what is going on. So, I need to know all the pathways that’s happening.
Absolutely. And so, for me, what I love about the map is the visualization, again, first of all just being a data person, loving the idea that you can visualize your data. If you can’t visualize your data, you’re probably not collecting the right data. So, for us to have that and to see the related entities, ’cause somebody’s always trying to blame a database, that poor database has done nothing wrong, and these maps help me understand everything that’s related, and that way I can track down that root cause much faster.
Well, I think what I like about it is, I admit that I don’t know how everything is connected. You can’t. It’s an incredibly long tail. There’s things that you work with over and over and over again, and then there’s a whole lot of things that can break, where you’re doing root cause troubleshooting, and being able to figure out what those are, you’re spending so much of your time just trying to figure out not just how things are connected, like at this case, at this layer, the first couple OSI layers, that’s easy. Something is plugged into a thing, or it’s configured and it’s bound to a port, or it’s attached to a MAC or it’s hooked to an IP address. But especially when you get into application mapping, or how VMs are connected or something else like that, well that’s not something that maybe we always know, and so you need to be able to essentially walk in real time dynamically into things that you’ve never seen before because it’s your guide into the unknown.
Well, and for me I think, especially with the network layer itself, when everybody is out there creating applications and they’re creating DevOps, like everybody’s trying to get the information which that they’re going through there, I don’t know every tool that everybody’s using, ’cause there’s free tools that are out there, there’s things that are on the network that I wasn’t prepared to actually create for, right? And so, a lot of the times when we’re looking at the mapping and I’m seeing the traffic, and then I go a step further, when I start to analyze it and look at NetFlow data, things like that, I’m like, what is this? Like, how can I see this? So, being able to stay up to date with the traffic and the flowing of how things are being used, helps me to mitigate and as well as troubleshoot quickly, to figure out where I need to do or if I need to apply any new policies, any QoS, anything of that nature, to make sure that everything has the vital necessary bandwidth that it needs.
Well, I think you hit on an interesting point there which is, it’s not so much about the maps themselves. Visio is an amazing tool. And you can create beautiful maps in Visio, and in fact you can throw them into the background in a regular Orion map, and they look great. Add your objects on top of it, add applications, that is really, really cool. But when you’re really talking about mapping, sort of at a fundamental level, is you’re really talking about interconnections and how things are connected to get to the point that things are maybe dynamically mapped so that you don’t have to draw everything by hand. So, it’s sort of a different way of thinking about going beyond what you see visually, in any interface, but how the data is connected, because that’s really the magic of maps. Because it’s following the way our minds work, it’s following to the applications work, so sort of learning to think about maps not so much as an art project or a visual design project, but more an information relationship project and then this just ends up being sort of the visualization layer on top of that.
And I would add, in the world of data and databases, where you’re getting your data from today, like back in the day, you had an idea, it was there, there, and that’s about it. These days, it can come from anywhere, right? Some dev or some business user just decides to connect to the database, and they weren’t there yesterday.
Just this one time.
Just this one time. And, so data comes in, data goes out. With the mapping, this is all data that we’re already collecting. It’s not anything special or new. We’ve already had this data but we’re just putting into these maps so now I can go in and say, “You know what? At that moment in time, this five or 10-minute window I can see where that database server was being touched or touching all these other entities.” And it helps people to reinforce the complexities of their data environment, to understand how much is coming in and going out.
I think from a security side, one of the points that I like with the mapping that helps me trace things out and to go through there, is that services are making calls with applications that maybe even the application owner doesn’t know. Like, maybe I’m going to go out and I need to do an address lookup, or maybe I’m going over here and doing a credit card transaction, I need to know this for my ACLs, I need to know what the prox– where we’re going with this, what needs to be allowed, what needs to be normal, what is not normal, what’s the exception? But I don’t know that until I have some kind of a visualization of actual represented data.
Yeah, maybe you also just don’t like getting yelled at because we always end up, and it’s not IT flows downhill, but how many times do we end up responding to events on systems that really, we don’t have any real authority over?
To your example, that one lookup, but it’s going to run for three years, teams are going to change, and somehow you have to be able to find it to troubleshoot but also just document it. Like, all of a sudden, you’ve got, “Hey, we actually have a compliance issue here. We need to show what all the interdependencies and the relationships are.” Mapping is a great way to do that because it lets people maybe who aren’t as technical, or they don’t tend to think of like, “Well I know everything in this subnet is somehow related, I’ll figure it out.” Those visual representations really make it easier to answer questions in a way that make maybe less-technical management feel a lot better that you have it under control. I mean, it’s the same reason why if you’ve still got a plotter, every now and then create a big, beautiful, complicated diagram, run it off on the plotter and stick it up on the wall, and as leadership walks by, they’re like, “Wow, our team has just got this under control.”
Except that, everything changed the minute it was posted.
All right. So then, let’s kind of dive in and touch on these points, and show how we can represent this. And let’s start off with the network. Okay, so here’s a network map. Now, how I got this is based upon, I went to a node and on the left-hand side, you guys may not know this is even here. But this is one of the options available now, is mapping. And so, when you click onto that, depending upon the type of node that you’re in, it’ll take you to its network map layout or its design technology layout which that’s there which you guys will cover in a little bit. So, when I’m here, if I click off it by accident, a lot of you have told us that you’ve done this, you can click on to the node to pinpoint all the connections and bring the sidebar back.
Right, so the sidebar again, is whatever’s sort of gray-highlighted here, everything that’s listed in the sidebar are the things that we know about it. So, like, in this case, I’ve got a physical port list, I can actually see the physical hardware sensors that are a part of that that are reporting. And then I can see things like, I think these are IP SLA operations that are assigned to those and I can see my logical ports as well.
So, go to something more exciting. This database server. What do those entities show us?
Everything that would be associated with that. So, we can also put the drilling in, and this from your SQL Application itself, the Overall Hardware Status on the box as well you can see here, your Interface, Volume, things that are connected to there. But on this one, when we go back in here, I want to focus on the network because, obviously, me.
Because that’s what the problem is, is the network.
Exactly. No, no it’s not.
It’s not the network.
So, when we go through here, you can see that it’s actually showing you kind of what a lot of people have asked for, weather mapping. So, when people say weather mapping, it’s not the actual weather mapping, but with us talking with the THWACK community, they’re like, I need to know, how is the flow going through here. Am I have a problem with the transmitted or with the received? Is there something that’s going on with this connection, do I have it in full duplex, is it half duplex, there’s a lot of questions that get answered that a map visually…
Nothing ever goes wrong with a port, ever.
No, never. And nobody accidentally shuts it off when they’re playing around with monitoring tools, right?
How’d that happen? So, when we’re showing this here, you see the red here obviously. Now, when I hover over this, it’s telling you that this is going to be the Error and the Discards.
If you notice right below it, it says Traffic Utilization. So, we understand that at the moment, you guys are concerned about the Traffic Utilization, but based upon alerting and thresholds that we know, if there’s an event-correlated issue that’s actually happening here, we will put the alert here. So that’s how you’re getting the Error and Discards, because we’re like, “Hey, the Traffic Utilization, we have that information as well, but we really think you should see that this is spiked up and that this is a problem here.”
So, one thing that’s a little bit different here, like I did make a little joke earlier about NetPath. But NetPath is different in that it’s automatically creating all of the dynamic thresholds for what is green, yellow, or red. But in this case, you’re saying that if there’s something that’s coming up red, because this is a network view here, in this case it’s Errors and Discards, that is something that would be hitting the threshold based on this element. Now, you might not have an alert actually defined for that. It depends, if this was really sensitive probably you would, or maybe this is extra noise that you don’t need so you wouldn’t, but that way you can at least know, if I have an alert defined on this threshold, if it’s red, I would be getting an alert.
Yes, and actually when we were talking about in one of our THWACKcamp sessions that’s coming up, is Optimizing Orion, and within the settings for the polling settings, you can change those thresholds. So, if you don’t have alerts, that’s where you would look, why that would be alerting, ’cause you have thresholds that would be set there. Now, the great thing about these views that I personally like is when I’m troubleshooting a problem, you’ve got to pinpoint it, right? And so, we obviously know we’re starting with this node. I’m going to click into this transaction. I’m looking at the connections between these that are happening. On the right, you’re going to notice, it is going to pinpoint us down into just these two locations. So, I can see which ports are available, which devices are on there, and then I can visually see, okay, this is the one that’s having the problem, it’s at 14.4k inbound not outbound Errors and Discards that are happening across there, right? So, this automatically pinpoints my focus on the entities of which that I am monitoring and where I should be looking for this.
So, it’s also going to be context-based, right? So, you think of, click on the thing to get its relationships. So before, when we were clicked on this node here, on the sidebar we saw all of the things that related, the four things it was physically connected to. If we actually come into the link here, it’s just going to be narrowed down. So, in a way, these are kind of like drill-in auto-clickdown like anything else, and that’s the way that I typically want context, is that I’ll go and look at this and say, “Okay, well, what is the port?” Well, they’re not listed in the map, but here’s the physical ports over here. So, if I click on a port then I’m going to get all of its relationships.
Definitely. And then also, if we click into that device itself, you’re going to see all the ones that are associated with it pertinent to the ports and then the transactions, the voice over IP ops or operations, everything across there.
Oh, and look, there’s an application that’s– So this one I guess is probably coming out of WPM, and it’s suffering a transaction latency that’s having something to do with these Errors and Discards.
Right, and so when we drill into that to see, since we knew that it was coming from that device, all the other entities that we are monitoring will show up on the right-hand side. So, I’m already troubleshooting, and like you were saying, you’re like, “Oh, look, there’s a problem that’s going on here, these are probably related,” now I can go to say, Network Configuration Manager or something, and I can see what’s going on, drill more into it, or make a change that was happening there.
So, before this map existed, how would somebody have discovered these relationships between these entities in order to troubleshoot? ‘Cause what you’re showing me is fabulous, you’re right down to the specific device and the port and all that. How would you have done that before this map?
Oh, we’re going to get into it. But now, you were doing enough with unstructured data to think about grids, right?
So, grid data, I can have things that are associated with different objects and then it sort of spiders out, and I can come into any part of that grid map and then walk, right?
Well, when you think about, I mean, and you’ve been working with, for those of you who have been working with the Orion Platform for a long time, you’re used to Sonar. Once upon a time, that was just sort of IP sweeps, now it’s doing nearest neighbor and it’s doing a whole bunch of other protocols, and so all of that data as it’s collected is sitting and has been sitting for a very long time inside the Orion database. So, in this case, that kind of context happened during the discovery. So, the answer to your question was, how did we do it in the old days as admins? You know. You did this with databases. You’d go draw the diagrams, and you might log in to the configuration on a router or a switch, and you’d look and then you’d look at the application definition and then you’d collect it all, and you would draw it. And to your point earlier, it would be exactly as fresh and accurate as when you hit Print.
Right, we would trace and say look for hostnames, IP addresses, client IP addresses…
The Wireshark maybe, it’s a little advanced for most DBAs I guess, but yeah, a little networking tools in order to figure out all the activity that’s happening. Or, what I want to get to is, you start writing queries against the monitoring database to build a view to say, “All right, tell me who was doing what with this entity.” And with the map, what you’re doing right here, it’s point and click, and it’s like, “Oh, let me go to this thing, oh and let me go to this thing, oh and now it’s over here.” And that gets you three and four levels left, right, however you want to say it, faster than any other way you’re ever going to get there.
Well, I was going to say, the funny thing is, many of you have been actually using this for years and you didn’t know it because Network Atlas maps– You know when you go in and you create the map and then there’s the Connect Now button?
That’s doing exactly what this is in terms of logically connecting those objects. In here, it’s now just dynamic and it’s being rendered as a part of that control.
And something, as a networker, what would happen before when this would happen, like you’re saying it was an application problem as we can see from WPM, the Web Performance Monitor, is that a user is probably having slowness right now.
Oh, I’m guessing.
So, I’m getting a phone call saying, “The network’s down.” [laughs] “It’s not working, I’m trying to go here and it’s not happening.”
Or you’re looking at your page abandon rates.
Right, the marketing team is upset because all of a sudden, their kind of upsell off of people who are browsing the site, is going down.
Or you’re getting that phone call and they’re saying something, and you have a long list of things that could be the possible issue.
And then that’s the list that you’re literally multi-threading your department for. I’m saying, “Okay, you take over the database side. See if there’s something going on over there.” And it’s like, “You take care of it, write your awesome queries and see if you can come up with a tool that can help us figure this out better” for the meantime. And then for me myself, I’m sitting here going to each device that I know could, might, be a part along the chain of where they’re going, and I’m running show commands, I’m trying to get, I’m using Wireshark, I’m using things like that, and I’m using all of these tools to try to get some idea of where something might happen. Here I’m visually clicking in, I’m going down the layers, I’m seeing only the entities of which that are resulting to these devices, I’m pinpointing it in, and then, if I’m wanting to look at these, I can drill into this device, the actual interface because I’m saying, “Hey, this is the one that’s having the problems.” And the great thing that I like about it is, no longer am I still going to the device, I’m just in this one tool, I’m scrolling down here and I’m automatically looking at that Interface Config. Because before I was doing that, I was going to the device, I had to get the show commands, or if I was using configurations I was looking through it, word find, I’m like find this, find this, find this, so I can figure out what’s going on. By having this tool and being able to pinpoint it in and drill into here, I am already troubleshooting as I’m looking, I already know what’s involved and then, when I want to go into the device, I can look at the interface, the config for it only, this is huge for me. I mean, as a networker, this is like…
That’s because you actually care about how ACLs are configured.
Right? So, I’m so excited that I’m able to just drill into here and be able to see the interface config, because it saves me time. I’ve already saved time by pinpointing it, and I’m saving more time because I’m not wasting everybody else’s time to tell them to go look at things and try to drill into it. I’m visually looking here, seeing the duplex, seeing everything that’s coming across here, verifying the health, because I knew exactly where to go to.
But what I was going to say, let’s say you’re a database admin, or as many of us end up being, accidental DBA…
Right, so you are not a networking guru. And so, you believe…
I can spell network.
Yeah, and you believe…
You believe that the way that applications work best is if the pipes are as wide open as possible, right?
And only for data traffic.
And only for, well okay, but, oh, I like where you’re going with this.
I can do that.
I like where you’re going with that. So, let’s say that you see an issue, you’re looking at a map, you start surfing the application perspective, and we’re actually going to show you this here in just a second, but you start coming from that application perspective, you normally wouldn’t go to the network because you wouldn’t be able to pull the config, you wouldn’t look at traffic-shaping rules, capturing classes, anything else, so to your point, like sort of optimization maybe QOS to accelerate database traffic.
I could do a ping or a traceroute in order for me to understand latency between two points.
That would be the extent of it.
Exactly, but let’s say you are a DBA and you’re looking at the relationships in the application map, well if you click on the physical network that’s underlying that application, all of a sudden, I can send this as a link or a screenshot or whatever else or throw it in Slack to the networking team…
And they know what to look at.
So, do you want to talk just briefly about, is there anything in particular that we need to do for discovery, to make discovery more effective? Because again, the power of this is based on what it knows about on the networking side, how they’re related. And I suppose this is also a general tips and tricks recommendation for the way that discovery always ought to be done for automated dependency mapping, alert suppression, a lot of other things. So are there anything that they ought to make sure that they take care of, as a part of that initial discovery to make sure that the database is really populated with rich data.
With rich data? So, what I would suggest is that when you add in your nodes and you’re adding your interfaces, that’s within there, there is a layer 2 and a layer 3 that it will actually grab the information from in your list resources. Now when you’re doing this, I always suggest that you click those because as I always tell people with NetFlow, NetFlow is layer 3-slash-4 information, but you miss layer 2 information. So, a lot of times, people wonder why the bandwidth is off from the NetFlow data, and it’s because it’s not seeing layer 2 traffic. And the more complex environments there are, you’re not seeing everything that’s on there.
It’s only seeing what NetFlow is reporting, not the actual counters that are part of those interfaces.
Exactly. So, what I like to tell people is, especially with your mapping, to make sure you’re catching everything that’s connected, and if it’s not producing NetFlow then make sure you have that layer 2 and that layer 3 routing that’s actually being checked in your List Resources when you’re adding those devices in, when you’re doing your discovery. I always like to make sure that my rediscovery is on, and it’s set by default for 30 minutes. And so, depending upon how big your network is and how much load balancing that you’re using with additional pollers or the scalability engines, that is also going to determine that number. Which we talk about also in other Labs as well. ‘Cause you’ve got to think of the database guy, right? Like, don’t make him crazy.
Think, think of the database guy.
Nobody ever thinks of us.
But we need to make sure that for me, my one big tip and trick that I could say, is network standardization. I would say that is also vital because as soon as you know what the naming schema is on the devices, that can also help you locate, understand, and know what’s going on with the device without even trying to think about it.
Naming conventions? No, we should name things after Smurfs. I think that, yeah, that doesn’t work so well. So, so, oh, sorry.
No, go ahead.
Okay, but another reason I really like your recommendation to regularly rediscover, and in fact, I know it seems expensive, but occasionally global rediscovery, rediscover everything, and you can set the interval on that, you don’t have to do that every night. But the other thing is, when you look at the increase in capability of the discovery engine itself, from release to release, every time you upgrade, you’re going to often get additional increases in the richness of that discovery data. So, by rerunning discovery, you’ll see things start to snap together that otherwise wouldn’t if you don’t go and rediscover them every now and then. So that’s another good reason to do that too.
I think, and you’ve touched upon it, but the point I want to hammer home really, is that this is data we’re already collecting. Or have the ability to collect.
And have been for a long time.
And have been for a long time. And we’re just now starting to get these into visualizations for everybody. So, a rediscovery, the structure’s already there, we’re already collecting this data but now we’re able to, I’ll say, make it more magical.
What I was about to say, we’re about to start talking about application mapping as well. But before, do we want to give a shout out to those who are helping the UX team?
Because the Usabili-buddy program, first, is a chance to get as many THWACK points as actually doing a beta. But when you look at this, this is again, whether it was THWACKcamp last year we talked about some of the things that were coming along down the pipe for the Orion Platform, these are a lot of these things, especially when it comes to visualization. You are incredibly helpful, your comments in THWACK are great, and especially working with the UX team, you’re looking at literally what you’ve been asking for because you’re helping us do that.
Definitely. How about we get started more towards the applications?
We can do that.
Well, if we’re going to do that, I’ve got to drive. All right, so this is an application map and the world’s most simple application map for Exchange. So, we got two servers over here, I’ve got my east server and I can see all of its attributes, because again, these are all of the things that we know about it, and I can see services, mailbox source so that like AppInsight view because this is an AppInsight monitor. Now, I don’t necessarily want to show the map with all the mailboxes, I probably don’t want to do that, but I’ve got another server over here which is my west. And if you think about what’s going on here, this is a regular sort of clustered set of Exchange servers. Now they’re running in VMs, so they really don’t know all that much about themselves, about how they’re related physically. They just know that they, they know about each other, but they don’t really know much else. And I know you probably have an opinion on that, and…
Well, you mentioned virtualization, so I’m curious.
All right. Well, I think you’ll want to talk about this in a little bit here in a second, but the thing to look at here is that everything seems to be green. I don’t have any process issues down here on this server, and I can really quickly look at it this time and I can say, “I do have an alert on this Exchange instance, okay, where’s that coming from?” But my ports and everything else, the services that are running on the system seem to be okay too. But again, as Destiny was saying before, where normally I would get information about the connection, in this case, because I have something that would qualify for an alert that’s being surfaced and applied to the top here in red, I’ve got a lot of latency there. Okay, so the question is, where is that coming from?
It’s not the network.
It’s not the network.
What else is that 88 milliseconds saying?
See, I think Tom’s point is, well, where is it occurring if it’s not occurring on the network?
It’s the age-old question.
That doesn’t mean…
Let’s get to the prove it part.
But that doesn’t mean it’s the network’s fault, right?
Oh, I didn’t say that.
Okay, I did.
You did say that. So, I’m going to click on, guess what, 88 milliseconds. And so now I can actually see the relationships between those two again. So, it’s not the context of east or the west server, it’s how are they connected? And so, if I look over here on the right now, I’ve got something a little bit different, right? These are actually the application connections. Oh, look, there’s my 88 milliseconds. Now, both of these nodes are green because those nodes are okay. But if I drill into this, I can actually see the traffic mapping between these two. So, this is coming from when you install the agent, like for those of you who’ve clicked on up under Start, Quality of Experience, being able to actually sample, it samples the traffic, not samples the traffic, it’s watching the traffic. It’s that little shim driver that installs as a part of the traffic analysis agent. I can actually now correlate where all of those conversations are. So, in this case, here’s east and west, I can see the two servers on both sides of that conversation, because remember, when I clicked on the center of that, it’s telling me how they’re related. So, they’re related elements here. And here I can actually see all my traffic going in this case from west to east, I can see the ports that it’s flowing over, I can see also the executables or at least the services that are exposing those ports and that are running that traffic, and here I can see my 88 milliseconds and I can see which applications are actually red. So now I know you’ve been trying to blame the network the whole time but let’s just assume it’s not the network and it’s the application. So, it’s the application, right?
No, it’s not the application because an application is a victim of the…
The network [laughs].
The resources that it sits on.
Especially if it’s virtualized. So, when applications have problems generally, it’s going to be about resource consumption. It’s going to be about memory, or CPU, or it’s going to be about storage or IOPS or something else, right?
So, in this case, using the map, and I’m essentially walking through the map, I went from two servers that were related by in this case the traffic, where they’re flowing between them, that don’t really know too much about how they’re topographically connected in terms of, which one is in which VM host and the rest of it. But now I can take this down and say, “Oh, I’ve got an Exchange source service issue.” So now I’m going to drill into that service…
So, before you get there, I just noticed this 88 milliseconds was on that original graph, right? And I was wondering about, because there’s only one alert there, but what I’ve noticed is that this is the summary. 88 is the sum total, 70 is the bulk of it. Now, this two milliseconds might be a warning ’cause it’s a yellow and all that, but obviously this is the thing that we want to focus on. So, what I really liked is was basically that hierarchy. I know 88 isn’t like you said, this really isn’t the issue, service host to here, although a traditional tool might just be showing you that, the total sum latency. We’re giving you granularity into understanding all this. You might have to come back later and deal with these yellows for a secondary issue
Oh, you are.
Yeah. But right now, we can focus on that red one to get started because the store service, because we know that’s the bulk of that 88. This is, for a data person, I can’t tell you how much I’m just loving all of this structure and summary and granularity and helping me drill through.
So, you can blame it on the applications.
Or the network. Anything but the database.
Well, for me, coming into here and drilling into this to see what’s going on when I get the phone call, this helps me out to be like okay, now I understand the traffic and the flow, what’s being used behind that. So, then I can look at, if there’s any QOS policies or anything that may be there, an ACL that may have been done, look for anything that’s changed, ’cause that’s what always happened, what’s changed.
So that’s what I can start looking for while you guys are more focus point and drilling into it.
And there’s another thing that I use these sorts of views for too, is a lot of times when you look at something, you’ll actually see the events that are correlated alongside, that’s actually raised by that. And I know we have done, if you have not noticed, that we have done a million episodes on how to minimize the number of alerts that you are getting, go back to lab.solarwinds.com, look at our archives, because if you are getting alerted to death, they are not working for you. But in a lot of times, there are amazing events raised that I don’t see, that I don’t have an alert on, and maybe I don’t really care about most of the time until I’m solving it. So again, this sort of using mapping to then correlate to extra events, like here on the right-hand side, here’s all of my events that are related to these. So, did one of these raise an alert? Could it have? Or should it have? Maybe this is a candidate for one that I actually want to raise an alert on. A lot of times I’ll use that as well so I don’t really have to know until I need to know. So, I use it kind as a Google for looking up events that are related to the things I’m trying to troubleshoot. Do you want to?
I know I sort of challenged you.
Just yes. Just drill. I’m waiting for you to drill, let’s go.
You want to talk about virtualization?
We can talk about virtualization.
All right, let’s do that. Let’s put you over here, all right.
Okay, so what we’re going to look at now is a mapping specific to virtualization.
I already said you couldn’t do that.
Yeah, you said we couldn’t do that even though you had the two clusters and there was a line between them, that’s okay. What I want to do here is I want to take a moment to just show the power of again, the data that we’re collecting and the mapping we have and so we can see all the related entities. So, what we’ll do is, we’ll start I am focused on this Hyper-V cluster right here, so I clicked, now I can see all of the guests that are immediately tied to that particular host. That’s all these entities over here, and I can filter, we can show, I can look at different metrics over here, I can look at Hardware Categories or Interface, Volume, Virtual Machine, or just the full list, I’ve got 57 things to look at here. There’s only one particular alert. Web02, which seems to have a problem, so that guest on this host seems to be an issue.
So that’s an application that’s having a problem with the database that’s running on a VM on this host.
It could be…
Maybe, we don’t know. We just know that there’s something bubbling to the top for that VM guest. If I’m curious about it, I might want to drill into it, but I’m not curious about it because it says Web02 and not SQL. So, we’re going to look at this one that says SQL. I’m going to click on this, and I’m going to get an idea of what is running there. It looks like it’s Linux and MySQL happens to be running. Now I’m not afraid of the letters SQL, even if there’s an m-y in front of it, so…
It’s yours, you want it.
It’s mine, yes. That’s how that works. But I can get an idea of again, the entities related to this, and now I can walk something. And I can see that Web01 tied to MySQL, and now I’ve got this thing way out, that’s just some rogue, I’m sure that’s some dev that’s just loaded something in his cube, it’s just running, he gave it some garbage name and he’s just trying to get the job done for the day.
That never happens.
Right, and now in between those two seems to be the issue, and then again, I can click like you were showing before and drill down. But back to this guy, again the power of understanding what now I can go and say, “Well, what’s really happening here?” Now, if there was an issue, the power of me being able to click and then come to say, let’s look at the application details. Now, there’s nothing wrong necessarily here, but I’m getting a lot of information about what’s happening for that particular database engine.
There is, yeah. I noticed that.
And we’re seeing at the cluster level, right? So, the app looks okay, but something else is going on.
There is something else going on here. There is a warning. Let’s click and see what’s happening.
Back to that cluster. Running VM, again walking through to get an understanding of where that bottleneck really is, what’s happening here, it looks like I have two hosts clustered together and look at that, memory used is 93%. That may be a little too high, I certainly don’t have a lot of room for growth, if an event happens on one or the other. I’m concerned or curious to know why 93% is used on A but only 77 is on B. It doesn’t seem to make a lot of sense right now. But, being database servers, I believe tied to this, so…
But you were probably also getting a recommendation. It was suggesting that you move something.
Yeah, so that’s one of the great things about VMAN in general, is that recommendation. This could be tied to a recommendation all the way back where it would prompt me to say, “By the way, you know you’re having an issue here, and we recommend that you do this.”
So, wait a minute. VMAN is going to actually show you and give you recommendations, kind of like DPA does?
It absolutely does.
It does. And one of the things though about it that’s interesting for me, is that I’m a casual VMotioner, right? I will use it on occasion, but like our shared environment that we have for Lab, right?
We stay in pretty good communication, we’re on Slack, we’re chatting. But the point of that recommendation is, I ought to just be able to do it, but I use mapping a lot of the time to just verify that that’s okay. Because my gut says, you’re recommending that I move this VM from this host to another host. I would assume that recommendation includes some knowledge of the fact that they are at least maybe in the same cluster, and that one of them is not on the East Coast and one of them is on the West Coast. But if I am casually making those changes, I use mappings a lot of time to make sure that that makes sense, that moving that VM is not going to move it suddenly into a different cluster somewhere else, where I’m going to incur even more latency at the expense of improving or optimizing memory.
And that brings me to the point of SWITCH DAX that’s with the full redundancy versus half redundancy, that visualization that we have on SWITCH DAX, that can make a big difference if you’re moving different people around to different buildings as well. Because if it’s not in full duplex and you think you have the redundancy there, you’re going to be busy and it’s going to be putting you down on some downtime.
And so, you asked about recommendations, here’s a big list of them right at the top. Memory utilization for this particular host, reached a critical threshold and they suggest moving some VMs around. If you notice, back on that other screen, we were running 10 out 10 running VMs, 10 out of 11 there, so I’ve lost all my headroom. The reason that B has extras because of the way they’ve probably been sized, but you notice, again, I’m curious about this. Why are 10 and 11? What’s going on? ‘Cause to me, the cluster, they should be identical servers.
Pretty close to identical. If I’ve got 11 on one, and 10 on the other, did somebody move something? But then there’s a recommendation saying, by the way, you might want to move, and there could be a recommendation saying there’s something wrong with A that wasn’t at the top of the list. That happened to be a different server, but I wanted to make sure you could see that. We’ll make that recommendation, we’ll say, “You know what? You’re kind of overloaded on this particular one. You should move it.” Also, getting to root cause. Just through the map, drilling through, of course there’s Performance Analyzer as well. Click. Brings me right to PerfStack. So, I’m only a couple of clicks that I can walk that map, that visualization. I get to a node and I click, and up comes this beautiful PerfStack view to give me an understanding of all the pertinent metrics for me figuring out what’s really wrong.
But would you argue this is also a map?
I mean, it’s time-correlated histograms, but it’s based on the same discovery information that AppStack uses, that the mapping engine is using, and when you go out and expand and walk through the browser here to then go and add elements that are a part of a metric, this doesn’t look like topology to me, but this looks like mapping. That the correlation is exposing itself here as a vertical stripe of a bunch of histograms, but it’s still mapping.
I’m going to say no. I’m going to say it’s entity relationships.
But I spent so much time crafting that.
I know. But it’s entity relationship without a doubt, which is something mapping has, essentially. We’re drawing a map to give you a node graph, that’s typically what graph databases do well, nodes and edges. This is giving you those metrics, here are your nodes and edges right here, but we’re giving you I would say a graph more than a map. But I know what you’re saying, and to me, you could make that argument, that this is just another version of the map. You could just say, “You know what? It’s the same data, just displayed differently.” And what’s beautiful about that is this is going to speak to a certain segment of the IT population. There are people out there, you know what, I just need a sparkline.
And then other people in IT, they want the nodes and the edges. And they want to be able to walk through, and hover over, and click through that way. We’re giving different views of the same data so they can be consumed by different people and groups inside of IT. So, this tool has value throughout.
And that’s the interpretation of data that we’ve had conversations about ourselves. That as long as you’re able to understand it, that’s the tool that’s the area of which that you may need, but I may need something else. So, when you have one tool that’s able to display that in different ways to show that visualization, I mean, that’s what helps bring teams together even though they’re unique individuals.
Or even if it’s not a common tool, it’s just the approach of using the tool in a common way. You can explain to anyone on the team, “Here’s a way of navigating dynamic maps, and it’s going to work the same regardless of the context.” Now, as an aside, and we’ll wrap up this section, but as an aside, I do want to say, this is an example, there’s criticism that SNMP is old and busted, and it just isn’t good for anything and that’s not true. It’s ubiquitous, it’s been around forever and it is in a lot of cases, especially for networking, the only thing you can use. But talking about graph databases, one of the things that’s nice about this is, when you’re looking at virtualization mapping, we don’t have to discern traffic, we don’t have to try to put node IDs together to figure out what the mapping is. This is a lot of math involved in those network maps. But here, you’re going against the APIs for vCenter and for Hyper-V, and so it comes back as a nice dataset. So, one of those things that people always say is, they say, “Oh, SolarWinds Orion is all SNMP.” No. We’re a software company. We like APIs because the data that comes back is so much richer. So, in this case, the only thing that they need to do to be able to get the data that’s actually powering maps, AppStack and PerfStack here is, credentials for…
For vCenter in this case.
So, I hope you don’t mind me stepping on your toes a little bit there. You made mention that you couldn’t find those relationships between those virtual servers, and well, you were just wrong.
I’ve been schooled.
It’s okay. I will tell you, there is one thing I do absolutely need help with and that is distributed applications.
Ah, distributed applications and tracing, yeah, that’s a thing. Okay, well, let me show you a little something, and we’ll talk about it just for a couple minutes and then we’ll wrap this thing up.
Cool. You want to drive?
Okay, not to beat a cloud horse to death, but distributed applications, a lot of you are telling us, this is something that you’re dealing with more now and remember, that is basically, whether you call it monolithic application deconstruction into cloud native primitive services, or whatever you want to call it.
That’s a lot.
Yeah, it’s a lot. It was a lot, I actually was able to say it. It’s basically taking what was a nice, big, unfathomable thing and breaking it into a lot of other unfathomable, smaller things. So, if you are from the cloud world, if you’ve been working with them for a long time, you’re probably using AppOptics. So AppOptics is designed to be able to do, look at infrastructure, in a slightly different way than we typically do with Orion Platform, but definitely to do something which is application tracing. Which is, I don’t really know how these things are connected, so what I’m going to do is inject IDs into the actual flow of transactions from the front-end all the way to the back-end services and back, and then figure out how they’re connected, and then isolate areas where there’s choke points or something else. So, the views that I look at a lot of times, like here’s a set of services. These are sort of at that infrastructure view, and you hear us talk about Application Performance Monitoring, sort of APM instead of Server & Application Monitor, this is kind of what we’re talking about, is infrastructure plus tracing plus events plus a bunch of other things. So, in this case, these are all metrics that I’m getting back from different services. And this is a service, it’s like a hotel booking service. So, I’ve got an administrators panel, I’ve got an analyst services, I’ve got APIs that are maybe interacting with other third parties, bookings, a bunch of other things. So, I’m trying to figure out how these things are connected. Now again, we’re not in Orion here, this is inside of AppOptics, and if I go over here and take a look at the tracing view, this is an example of a single transaction. I’ll save the trouble of drilling down to it, basically I would be on a service, I’d isolate a transaction, I’d go through the heat map, and we’re going to show this actually in a session in THWACKcamp, you’ll want to check it out, of how this actually works. What we’re looking at here, is this is a single transaction, I can see the data for it, I can see all the layers, like if we were looking at a regular application stack in a monolithic app, that’s what these are. In this case, these are services that are connected. But just like we were talking about before, figuring out how things are connected and then the PerfStack View that you had a minute ago, my argument that that’s actually a map. Well, if I look across here, these are all like transactions against databases, springs, the traffic that’s running across my rails, each one of these little stripes here is a database query against a mongoDB. The thing I want to call out here before I show you what this looks like and something that we’re kind of experimenting with, you see how I’ve got all these boxes across here that represent each one of those layers?
Well, in the same way that when we were looking at that PerfStack view, these are maps. So, if you think about it, that’s a map of the components that actually represents all the transactions that are happening, all of the dependent calls that actually make up this one single web transaction. So, I would argue, that this is actually a form of mapping because it’s taking what is data that is coming from monitoring or at least receiving the traces that are actually running across, it’s pulling them into this view. Now again, this is AppOptics, this is a cloud service, it’s a SaaS product, it’s running out in the cloud, but you’ve been asking initially for IIS and .NET-based applications to be able to do application tracing inside of the Orion Platform. So, something that we’re experimenting with, and what you want to do is take a look, always go out to THWACK.com, search for what are we working on and there will be details about kind of what we’re thinking about with this product. But definitely check that out. But I’m giving you a little preview here. So, here’s a node that this is an IIS server. So, I would get my regular views on this server, and I can see whether it’s performing well or not. But I’d like to be able to include traces along with everything else. Because I’m going to see performance, I’m going to see traffic maps, I’m going to see all the things that we’ve been talking about before. And if I drill into the mapping over here on the left-hand side, I can look at its physical infrastructure or network, the rest of it. But I’d like to be able to also trace the distributed components of the application that are running on that IIS server. And so, for that, this is what that looks like. And so, over here on the side, you’ll notice this little SolarWinds APM. Now, this is an add-on, this is not built into SAM. And so, what’s happening here, is this looks like the kind of thing that I would normally see. I’ve got status codes, I’ve got response time, I’ve got requests per seconds, I’ve got my methods that are actually being called. But these are actually coming out of data that’s coming from AppOptics, coming from the service in AppOptics to be able to do APM monitoring in addition to infrastructure monitoring. So, you can see here that I’m actually looking at an IIS application pool, so it’s sort of running at that IIS.net level, and it’s pulled it into that view. Configuring this is really, really easy. I’ll save you the trouble of walking through here, but it’s Settings, All Settings, and then Integration and then run through it, it will take you to this page and then you click on Add. And then there’s a wizard that will walk you through, that will go through the process of installing an agent that it needs and actually connecting and getting the data integrated into the view.
So, we’re really interested in getting your feedback on this. Again, take a look at What Are We Working On in THWACK. We hope to get this into a release fairly soon and this is something that, there are a small number of customers that need this a lot right now. But every time we talk to you guys, whether it’s the chat around SolarWinds Lab, or like, at Cisco Live! this year it was amazing how many people, how many of you came by and said, “Hey yeah, will someone get this thing now, where we’re inheriting containers and we’re starting to actually do a lot of microservices and distributed applications, and what does that look like?” So, this is the beginning of that. We really want to know what you think about it.
Well, I think it looks great, but any limitations for this?
There are a couple limitations. This is IIS and .NET only. You don’t get all the views that you get with TraceView. So, if you are cloud native, if you are doing a lot of distributed applications, then this is actually probably going to be more appropriate for you. And that you can imagine what it will be like to integrate all this into Orion. And then the last thing is, I’ll let you come back over here because you’ll care about this. If you’re a fed customer, or you’re airgapped for security for networking, you don’t allow access out to the cloud. Because this is based on the AppOptics back end, which is a SaaS product, if it can’t reach there, if it can’t send metrics, it can’t send the trace data out to that endpoint, then this just won’t work.
Hopefully, you’ve found this helpful. It is, and it’s not a surprise, how much we all rely on mapping regardless of what we are trying to fix in operations.
That’s really true. And when different teams can use solutions to visualize infrastructure regardless of how it’s connected, it’s just a huge step up on troubleshooting, and it means that it just saves you time.
Yeah, absolutely. I mean, we’ve all been there, with those Visio diagrams that seem to be outdated the minute they even get published. When you’re tracking issues in your realm, having the ability to strategically visualize and then dive into and look for possible root causes to help you resolve issues even before they happen, well, that’s practically magic.
I couldn’t agree more.
Yeah, and of course, if you have questions, we’re here live, so just throw them into the chat box over here to the right, and if that chat box isn’t there, it’s ’cause you’re not with us live, so visit us on our home page. That’s lab.solarwinds.com. Get the schedule, and set a reminder so you’re with us live next time.
Right. ‘Cause we really want to hear from you guys. All right, well, that’s all the time we have today. I’m Destiny Bertucci.
I’m Thomas LaRock.
And I’m Patrick Hubbard. And thanks again for watching SolarWinds Lab.