IT pros are discovering a smarter way to manage databases. In this episode, Head Geeks™ Patrick Hubbard and Thomas LaRock are joined by special guest, Distinguished Engineer Karlo Zatylny, to talk about the latest release of SolarWinds® Database Performance Analyzer, which includes anomaly detection powered by machine learning.
No, no, no, no, we need to fix this one first.
No, we oughta fix that one first.
No, no, we need to fix this one first because in the lifetime of hunches I’m usually right.
That’s often true, but this is the one that’s generating emails from the VPs.
Look, this alert is more unusual, right, so we have to look at this one because it could be something really weird.
It happens all the time. You just ignore it.
Hey, what you guys arguing about?
Well, we’ve got a critical issue here and we’re kind of disagreeing about where to start.
Yeah, noise can be overwhelming. I have a suggestion though, how about we use some data science to help us locate the signal through the noise?
You had me at data.
So, when you talk about data, what do you mean, Karlo?
Well monitoring systems collect a wealth of data, but you can actually use mathematics to help humans make sense of where to begin.
Ah math, and maybe some machine learning. And so we should assume anywhere there’s machine learning, and math, that’s there’s also probably APIs that we could use to go get the data.
You promise to wait till the end to talk about APIs?
OK, let’s get started. I’m Thomas LaRock.
I’m Karlo Zatylny.
And I’m Patrick Hubbard, and welcome back to another episode of SolarWinds Lab.
OK, so a long, long time ago in a cubicle far, far away, I built a thing.
Was it any good?
Well I thought so, yes. It was an alert for long-running SQL Server agent jobs.
What did you define as “long-running?”
That’s a great question. So typical monitoring systems at the time, they used a default of 120 seconds as long, but that didn’t work for me or my shop because, you know, for example, all of my database backup jobs would’ve been longer than 120 seconds, that’s just two minutes, so I needed to build something better, smarter.
So, this is where you’re going to talk about your mathematics degree?
Two degrees, thanks, and on top of that some extra Six Sigma training just to be sure, but yeah, I decided to build something smarter. So, I took the job, I took the data for job duration which exists inside the msdb database, and I built an alert that would notify me if a job was still running and if it was more than two sigmas beyond its 90-day rolling average.
Ah, that would make a smarter alert. You wanted to identify anomalies, not just something that was running “long.”
Ah, so you found a way to get the signal through all the alerting noise?
Yeah, and it was cutting-edge at the time, nobody was doing this, and that code actually made its way into DPA 2. If you go into the alerts, if you go into the alerts, in the administration alerts, you will find long-running SQL agent jobs, that’s essentially the code I was using. But that method, as good as it is, you know, it doesn’t take into account something like seasonalities.
I was going to ask you about that. I was wondering how you knew how to do that at all or if you’re filtering.
And Karlo, what do you mean by filtering?
Well, let me show you. So, algorithms are the foundation of machine learning, so today we’re going to explain DPA and its algorithms that it’s using for anomaly detection.
Karlo, that had no hype. It was just, algorithms are the key; I’m going to show them to you.
Yes, it’s available now! It’s out now! It’s the latest release! It has machine, OK enough, yes.
Sure. So, let’s start drawing some information for you guys to analyze. So in DPA, what we found is that most of our customers have data and that data is a squiggly line. And that squiggly line shows what your wait time and other data is doing. So, with our double-seasonal algorithm we’re breaking down that data into a variety of components that allows us to predict the future.
I’m sorry, double seasonal, explain that.
All right, double seasonal, so we’re taking your raw data and we’re breaking it down into components. The first component that we’re breaking it down into is the general trend, so how is your data going? Is it trending up, trending down, it’s flat. Also, we found that by analyzing a lot of our customer data, and our own internal data, that there’s a couple of seasonal components in addition to the trend. So, we also have what we’re calling the weekly trend. So generally, you start off low on Sunday and then over the week it’s a little bit higher while everybody’s at work, and then everybody goes home, and Saturday it’s low again. So we end up with this weekly trend as well. In addition to that we find that everybody tends to have daily trends. So, you run your nightly backup at night where you have a big spike and then it comes down during the morning and then it spikes up again while you have users in the office and back down when they go home. So, we have all of these different components of your data that allows us to understand how your users are using your system and ultimately what we can use to predict the future. So with these different components, we’re then able to draw a component where we’re able to look at your future data and add these components together to say, all right, what am I going to be doing at Friday, on Friday at 3:00 p.m., or Saturday at 5:00 p.m., what is my normal data look like? And with that prediction, and with some general statistical analysis, we’re able to put some bands around that as well where we have this range of what’s normal and what we would consider anomalous.
So statistical analysis, any particular formula, because this doesn’t look like k-means clustering, so what–
What are we really doing here?
An interesting thing that happens when you take raw data and you start subtracting these things like overall trend and seasonal components, is that you end up with data that looks like your old school bell curve, if you remember Gauss, and like that. So we end up with this normal distribution of the data so that we’re able to give just normal standard deviation and mean statistical analysis to the data which really helps us set standards for how many standard deviations away from the prediction could we expect your data range to be in. But, you’re saying, wait a minute Karlo, that’s not machine learning.
That’s not machine learning.
That’s not machine learning.
Well statistics, it’s a start.
Statistics, it’s the start, but we need more algorithms because we need the machines to learn. So in DPA, what we’ve done is we’ve sent out some students to start learning from your data as DPA is out there analyzing the wait times of your different databases, it’s going to look at its predictions and it’s going to say, my prediction is that spot right there, and all right, that’s my prediction. But what was your actual real data coming in as? Was your real data above that, or was your real data below that? And so, we’re going to start rating ourselves on, all right, how good was my prediction and what was this range between my prediction and what the data actually was? And so, the machine will say, oh I had a good prediction, or I had a bad prediction. On top of that, there’s other algorithms in the back that are making different predictions. So we have other algorithms that might make predictions higher or lower than the double-seasonal algorithm is doing, and they get rated as well as to how good they are compared to the actual. So if they’re more accurate, or less accurate, the machine will be able to use those different inputs and start to choose which of the algorithms are making the best predictions and then start to switch which of those predictions are doing it best and actually start to use the better predictions, and that is machine learning.
I like that analogy of students, right? Because we think always in terms of data, data, data, but in this case it’s the agent, the algorithm is a student that’s actually analyzing, so it’s that cooperation of the student plus the data.
And this is happening in the background right now for the latest version of DPA, the students are actively doing this learning right now, but the switching between algorithms isn’t there yet. We’re just–
We’re just gathering the data, we’re doing the math.
Yeah, we’re doing the math right now. We want to understand the confidence of what we’re going to actually be releasing in the future and so right now we just have the students out there learning. We’re also gathering data through the SolarWinds Improvement Program in order to understand what our customers are seeing. So we’re going to be doing some learning here at SolarWinds headquarters, to understand how our algorithms are doing and how are we going to be able to make them better in the future.
So the stuff on the left side of the screen here, this is really what goes into the anomaly detection part, is that correct?
OK, so and that’s, the last thing you’re drawing, the Gauss, you’re, I think what we have for define that something, if it’s critical, if it’s more than three sigmas away?
Correct, yeah, today we’re doing two sigmas away for warning, and three sigmas away for critical.
So three sigmas, that’s 98%, so if you’re within six sigmas of the standard deviation then you’re, basically we’re saying, 98% of everything should fall in there. So if you’re outside of that, that’s an outlier, that’s the anomaly.
And we’re able to make that confident because of the methodology that we’re using, and it’s got some mathematical soundness behind it.
Well what I like is, you know, the first thing you had over here, was basic linear regression. Am I trending up or down over time, right? That’s everybody’s that’s done capacity planning, that’s always the question. But what I like is you had this drawing here of, this was for the week?
For the weekly, yeah.
And this was for the daily.
Because a lot of tools end up giving you just that view of, not just, not just a tool, a database tool, but I’m just saying in general. Like whenever I had to talk to somebody that ran a SAN it was always an average, it was always well here’s this and this is what it looks like for a long period of time. I’m like, no, I’m having trouble at this moment, and they’re like, well, nothing’s showing up in our reports, because it’s all been smoothed out.
So they’re never getting that level of granularity that a database tool will give you. So this is almost a blend of, we have some granularity going on, and we also have some smoothing. And we’re getting a better idea of, is this really a problem?
Right, and that was key with working with our Data Science Team, was us looking at all of the different algorithms that are available for today, and today’s machine-learning realm, and saying, all right, which of these algorithms is best fitting our customer’s data? So, there are algorithms that are just general smoothing, or using one level of seasonality for prediction, but by getting the overall trend and two levels of seasonality, we’re able to make the best predictions. So in our testing, this algorithm ended up being the best in being able to predict the most datasets’ future. Now there’s definitely other sets of data that will not be able to be predicted accurately, but because of our testing in-house we found that this algorithm is actually one of the better ones for doing that.
So what I love about this, getting into DPA mostly, is especially if you’ve integrated with Orion, right, this will show up as a, in PerfStack, it’ll be there, and it’ll be an anomaly. So if you’re a network admin, or just a server admin, but you’re using Orion in general, you’re going to be able to answer that question, is this really a problem? You’re going to have it on the dashboard and you’re going to know, yeah, I really need to tell the DBA to do some work this time instead of just thinking you should blame the database, because most people do. You’ll know if it’s truly a problem to look at because it’s an anomaly, or if it’s just, performance has always been horrible. You’re going to have that insight like that, instead of just hoping the DBA is going to finally fix it. Like you’ll be able to say, no, no, it’s truly an anomaly, you should go to work now.
And not to oversimplify it, but part of it too is, you know, it’s to solve that problem of creating these ever more specific alert conditions, and thresholds, and the rest of it, is that these are learning thresholds that are also able to project into the future. So it’s, it’s not, it’s a different approach, but it’s a little bit analogous to NetPath, right, where it would be impossible to set thresholds on the internet so the data science there is calculating what is red, or yellow, or what is normal, right?
So again, it’s to get away from that, that expectation that you’re going to create these rules with ever greater levels of specificity that are going to finally figure out what the problem is.
Yeah, and that leads to a great point that we’re moving this into DPA first because we found, through our own research, that the database is such a critical component of many applications. We are running computers, we are running networks, just so that we can run these applications and the database is that standard application that everybody has that when an anomaly happens there it tends to flow upstream. The website is now slow because of the database and all of these other components are slow if the database is suffering. So by putting anomaly detection in DPA first, we’re actually setting the frame for our ability to bump that out into the other products so that we’re able to have anomaly detection across the board and give you the full power of machine learning throughout all of our suite of products.
So what I love about all that is a lot of times you hear companies, machine learning, AI, there’s a lot of hype. And when you’re–
Just a little bit, teeny.
Just a little bit, but when you start asking questions about, AI can be nothing more than, if then else, statement, because technically that fits the definition of what AI is. So when you hear companies say AI, machine learning technology, sometimes it’s just rules-based recommendations, it’s not really, we’re using math. We have real math going on here, we have actual data science happening, and if and when this gets into a broader set of products, we’re going to get that signal through the noise to you. And I am just so excited about all of this.
Wasn’t part of that, I mean, it’s the answer—it’s the unintended answer to the unexpected question, which is step one, get all of your data, collect all of the data.
Then data science is the first answer to, OK, we have too much data and it’s impossible to figure out what’s actually meaningful here. And then once you can actually get that into a form that can be digested, that machine learning is a technique that’s going to be applied on top of that.
Right, and the advantage of machine learning and AI, again, is that it builds on top of each other. So once you have one algorithm, its output can then be used by another algorithm and you allow the machine to make those choices of which algorithms it’s going to use to build on top of each other, and then you have this system that is tailored to your specific environment, to your specific needs, because the machine is actually making those decisions for us based off of the mathematical principles that we’re bringing.
Do you find that it’s able, that the breadth of the data that’s collected actually really contributes to the ability for machine learning to find unexpected insight?
Absolutely, so the more data that we have, the better able we’re to, we’re able to look better for those anomalies and for those patterns, and so the more data that we’re able to look at the better we’re going to be able to be at finding that needle in the haystack, or finding the root cause. So the direction is absolutely for us to go to the world where we’re looking at all of your data and just ultimately finding, you know, what happened first, what was the key change within your environment that actually caused this trigger of poor performance, or loss of connection, or whatever.
That really odd root cause that you would have never expected because it’s the relationship of the database, virtualization, and storage.
So the power here is the simplicity as well. Because you’re talking about all the data being collected, data is the fuel for this machine and the machine learning. So you want the right data, this is the important part, because a lot of times people will just collect everything, collect all the metrics, surely there’s some insights in all of this. But the thing is, if you’re not collecting the data to help answer the question, in this case I need to know if my wait-response time is an anomaly or not. OK, do we have the data for all the waits? Well yes, we do.
What, you care about wait?
So we have the fuel for this machine, that’s not always the case either. There are times where you’ll look and say, look you got way too much data and you’re actually not getting the answers you need out of this. It’s almost too many variables for you.
So this, a database wait set, it’s like the perfect machine-learning experiment, if you will, that you want to do because you know what data to collect, you know what data to analyze, you know what data to feed into the machine in order to make things a little bit better. So it’s not just this, hey, you know, throw everything under, into here, and get an answer out of it, it’s a very specific thing.
So it’s sort of, why gravity, right?
Like in this case of something that’s really hard to diagnose, lots of why, why, why what, lots of theories and trying to actually sort that out, you’re saying that that’s what should attract machine learning to where there are lots and lots of why’s and lots of disagreement about what the cause could be.
Yeah, and ultimately, we’re able to use the machine learning to help you find that useless information as well. So the machines are actually looking at your vast amounts of data and saying, actually this one’s not useful. So we have–
Yeah, there’s no correlation.
This one isn’t as relevant.
So we have this idea of golden metrics that you may have heard of, these are the important things to look at. We at SolarWinds have that experience, we’ve been monitoring for years, we know what things to look at. You know, we want to look at things when they’re slow, what makes things slow? We know those key performance counters that let us know that information, so that’s where we’re focusing our machine learning on as well as that we’re able to filter through all of that vast amounts of data and only focus on the key points like database wait time.
So I think the way I’ve been doing that is hyperparameters, is that similar to what we’re, is that the method that we’re doing for that, or?
Yeah, hyperparameters and just being able to do comparative analysis on the different time series, so we’re actually–
Able to do deltas and differences between different times series to say, how far apart are these different time series from like a Cartesian sort of standpoint, or a Pearson coefficient standpoint? And with that, we’re able to say all right, these ones are significant, these ones are insignificant, so you take the machine learning on top of the expertise that we bring as monitoring experts, and that combination is really what will drive us towards the future of even automated responses where we’re able to learn that this configuration change caused the problem within your environment, we see anomalies after that configuration change, do you know what you should do, you should go back and change that configuration to something, or Joe Bob, he made the configuration change you should go ask him why he made that change.
Because Joe Bob is the anomaly?
Yes, often you have an anomalous employee within your system.
Think of the world you’re in, though. What you’re talking about now is a world of desired-state configuration as well on top of this where you’re saying, oh you made this change and I can tell it was a bad thing, let me go back. Cause right now desired state is more like, oh, I just see you’re not in the right configuration thing. Imagine if, I know you’re not the same, but you might be better, or if you get worse, I’ll roll you back. This is, that’s, you’re going to change how we do IT operations.
Yep, absolutely, that’s the goal.
And Karlo, you said something interesting, which was, you’ve looked at a lot of customer data and, you know, many of you are regular viewers of SolarWinds Lab are also heavily involved in THWACK, our user community, and our UX programs, and RC programs. So you did work with a number of customers with a lot of different types of data, right?
Yeah, fortunately our customers at SolarWinds are really friendly and they’re able to share anonymous data and we’re able to take that anonymous data and run our different data science and machine-learning algorithms again, so a big thanks to our users out there for sharing your data with us. You have made the machine learning in DPA, and other products, much, much better because of it.
Oh, hey, before I forget, and I love just looking at graphs and dots and lines and all that, some of the lines are straight–
And some are, and they’re orange, this is all great. But you know what we really should show them is how this looks inside of Orion, we should show them PerfStack and DPA.
because it’s really complicated and it’s really hard to understand so.
No. It’s not that hard.
Even you’ll understand it. OK, so what we have, what Karlo has in front of him here is Orion and we’ve integrated this particular database server, right, using DPA IM, right, we’ve integrated DPA and Orion together. And so now you’re going to get this lovely new Wait Time Status inside of your PerfStack. So if you’re building a nice PerfStack Dashboard report and now you’re wondering, hey, tell me something about the actual database server here. I want to know if there’s an anomaly with regard to the Wait Time Status?
Yeah, and that’s–
The beauty of it.
No problem. So we tried to make this, per Patrick’s comment earlier, as complex as possible. So what we have here is the different statuses of the database as it goes throughout time and here you see there was a red status and a yellow status. And these are the anomalies that we’re detecting. These are the key indicators that you’ll just be able to come in, look at this chart and say, where’s it’s yellow this is where we’re at that two-sigma threshold of warning, and where’s it’s red that’s where we have detected a critical anomaly that we believe is atypical for your environment.
So there can be some anomalies sprinkled all along the green line, but these are the particular ones that the machine learning has identified are important?
Exactly. And so what this allows you to do is quickly find a point and time on the PerfStack chart, of what an anomaly, or when an anomaly happened, and then ultimately what other, what other things were going on within your environment.
So normal, if you look at this top, the Total Instance Wait Time, that you’re looking at on the top right, you see you got bigger bars, and usually tell people, big bar means bad, right so, but here’s the thing, I’ve got an anomaly, I’ve got something red on one of the shorter bars. Something was different at that point in time, but if you look at the overall trend, it would be hidden.
Yep, and that’s the beauty of the machine learning is that it’s learned that low is normal for that point in time, of that part of your season, or that part of your day, or that part of your week, and because it was expecting low, it sees high, that’s anomalous. So an anomaly might be triggered on a bar that’s not as high as the rest of them, but because it’s strange for that particular time, that’s what makes it the anomaly.
Now how does it look in DPA?
Excellent question. Let’s jump over to DPA and take a look at some of the great charts within Database Performance Analyzer that show you the anomalies as well. So here you see a classic DPA Wait Time Chart that you are all familiar with, with the different queries and how much wait time we’re seeing. And we’ve added below a new chart that shows you what those anomalies are and when they’re happening. So zero in this chart is basically what we’re predicting. We’re predicting a specific value for this particular point in time. And then the green is how much above or below you are for that particular time and if it goes too high above, or even potentially too far below, that’s an anomaly for your point in time. So here you see within this particular bar green ends at this point, and yellow is then a small portion of that, so we have a fairly small standard deviation, and then anything above that is going to be red. And that width changes throughout time. So you see that each of these yellow and red bars can happen at different levels depending on the time of day.
But this is showing me an aggregate, so this is all the queries. So basically out of 14, 15 hours of wait, you know, 10 of it was normal, an hour and 42 was in the warning range, and three, you had three hours’ worth of queries that were anomalous.
So that could be one query, it could be a hundred queries.
And that’s what’s going to point you to dive into the detailed data and actually find out what was going on at that point in time. Was it one particular query? Were you running hundreds of little queries that added up to your wait time, and that’s the finger pointing that we’re trying to do here.
So there’s a ton of deviation data here. So why not use whisker charts, for example?
That’s a great question. So we tested a bunch of different user-experience charts with lots of the users out there in SolarWinds land, and this was the feedback that we got, that this was the easiest to understand. That you come in, you look at this chart, there’s red, that means it’s bad, there’s yellow, that means it’s almost bad, green is good. So this chart actually tested best with our users.
And I notice you have, this green one way low, but it’s not red, it’s not, I don’t think we do critical for things that are really low.
Correct, for this version we decided that we’re going to allow everything that is too low to just be green. That when your database is performing OK and there’s less wait time, that’s not as critical as higher wait time.
So it’s definitely in our future where we could start showing too low of predictions are anomalies, but today we are showing only high thresholds as the true anomaly.
No one has ever complained that anomalies are low–
So that’s OK.
Nobody, nobody’s ever complained that the database or the website is too fast.
I’m going to move off it just for a second just to show, but you see on that day it looks like maybe somebody turned the server off. So, yeah that could be, that’s a different anomaly, but this is, this is where were heading with the tools is the understanding of what’s really causing it. So in the future we might see this and then be smart enough to say, hey, something’s really different here, was the server even on? And we have all the data through all the different products so we’re going to be able to get to that point of giving you that intelligence.
When the first time I saw this I was almost, I don’t want to say disappointed, but, I’m, you know, usually deviation and kind of calculation methods, and the algorithm that’s producing the data are presented as part of the chart. And at first I was like, well this is just green, yellow, red, it’s to draw my attention to there’s a problem, and I thought about it, I’m like, no, if you put all of the deviation data, and everything else up here, and then you use that data to figure out it’s an anomaly, that’s not machine learning, that’s human learning.
So the whole point is that it should be able to produce, at a glance, it is telling, it’s doing the work of telling you that that thing is an unusual circumstance that you need to pay attention to.
Remember what I always say, DPA gives you charts that even managers can understand.
We like managers.
I’ve yet to have anybody in a meeting just say, I don’t understand what the bar chart is communicating to me. So what you said is true, people, especially data-science people, they want the formulas, they want all the math and all that, but this isn’t built for necessarily a data scientist to use, this is built for a network administrator to say, I really need to call somebody, or it’s always been that way, or why is the server off?
Or I just want to solve it and go on to the next.
I just want to solve it and I have to move on to the next one, right, it’s bar charts.
OK, there is clearly a ton of data here and it would be nice to be able to get at it, right? And DPA and the Orion Platform have had APIs for a really long time so what I’d like to show is how to actually combine this data into something that’s useful, and to use some automation to be able to get this kind of data automatically from DPA maybe if you weren’t already monitoring it.
Does that sound cool?
Yeah, take over the driver seat.
OK, so I’m tempted to call this segment, if it had a title, Two APIs And A Command Line Tool.
Oh, maybe not.
OK, or maybe not, all right, but, so here we’ll sort of set this up. Here I’m inside of SAM and I’m using the AppInsight for SQL and I’m monitoring a SQL Server, right?
And I’ve got all of my metrics that are coming back.
And you can tell that some of these detailed, some of these metrics have a lot more detail, and mostly you can tell that it’s integrated with DPA because over here on the left side it says, DB Performance monitor, right?
So it knows about it, it’s passing data back and forth, and then over here I’ve got the detailed use. If I had I drilled into the native DPA console I would see the detailed information and the main thing that I’m going to be trying to solve is a wait time issue, right?
So here’s the story, let’s just say that you have 5,000 databases.
I have 5.000 databases.
I have 5,001.
Or some number n of databases that you would like to monitor occasionally with DPA, but in general, as long as they’re behaving, if you’re not messing with them and their resources aren’t changing, you can kind of leave them alone and just let SAM monitor them, right? So maybe you have a developer who’s working on it, or you’re transitioning that database from one place to another, or pushing it to the cloud, where you need to have monitoring. So the story here is, I’m monitoring a series of nodes inside of SAM and one of the things that I’m not doing is DPA monitoring on all of them, right? because I’m trying to kind of manage the number, the number, the number of instances that I’m watching with DPA. So that machine that we were looking at here, the one that’s expanded on this, when, for our queue box, is just a regular machine. But I’ve got this other one out here, this WIN-99 queue, and I really, I know SQL Server is running on that box, and I’m monitoring it, but let’s just say maybe I go into 90% CPU utilization for an hour. What I’d really like to do is start monitoring that in depth with DPA, because I really do want to figure out what the problem is, because the database can give that to me. So there’s two different APIs here. So one, I mean we’ve talked many, many times about the SolarWinds SDK, and working with SWQL and SWIS, but DPA also has an API underneath it which can do magical things. So here’s the story, we’ll set this up, so we have a, we’ll say if a server that is running, that we’re monitoring, is running SQL Server, and it pegs out for 90% for more than an hour, I want to automatically apply, add that instance to DPA. So I’m going to do that with the database. I’m going to do that with database, I’m going to do that with the API. So the first thing to note is that when you are inside of DPA it’s a little bit different. Remember we have, for the Orion Platform, it’s SWQL Studio, there are thousands and thousands of elements on hundreds of different objects that you can work with, or dozens, depending on how many modules you have installed. But the point there is that Orion, the Orion Platform is pulling all those things together. It is the thing that is the glue that allows you to connect all of those disparate data sources. But in the case of a database the API can be a lot simpler because everything is inside the database for the most part. The database knows what’s going on, but it’s pretty much a black box compared to everything that you’re separately monitoring across your environment with the Orion Platform which makes it easier to get at it. So the trick though is how do I glue those two things together, and how do I do it in a way where I’m not just, every time an alert fires, going and applying DPA monitors to everything. So here’s how we do that. So the first thing is, you’ve seen SWQL Studio, I’m not going to cover that again, and I’m not going to, this is not a programming class by any means, and so I’m going to try to get through this really quickly. I just want to show you what’s possible. So I’m looking here inside of DPA, so instead of using a SWQL Studio, I’m going to go to Options. And the first thing you’re going to notice down here, well, I’m going to scroll down to it is, Management API Documentation. So let’s click on Management API Documentation and we go to Swagger.
What is Swagger?
You’ve got some swagger.
You’ve got swaggers now.
Swagger is a miraculous self-documenting API technology, well, not completely self-documenting, but for you, it means that you’re not looking at documents back and forth and you can run things in real-time. So the cool thing about this is it can do things like this Access Token API thing, this is OAuth, and then there’s a Management API, which is what we’re going to be using here in a second. So I’ll show you how you can walk through and see how this works instead of, you’re not using SWQL Studio, right? So the, if I’m going to click on the Management API here and you can see that there’s sort of five main categories of calls that you can make. You can get data about annotations, you can manage the monitored database instances, you can register new databases, and then there’s API logging as well. And the ones that we’re going to focus on is actually Database Registration, and Database Monitoring.
So the way that this works is it’s based on a token, because with the SWIS API, it’s a username and a password, in this case it’s a token and it’s using OAuth. So the first step is I need an access token.
So remember the DPA’s API is based on a refresh token so that’s something that you get out of the UI or you ask an administrator to give you and then you can access it. So I’m going to expand this first, rest API, so essentially, it’s saying, Access Token is in service, I have several different things I can do on it, in this case it’s a POST, right? So I’m going to say, Try it out. And then it’s going to ask me for my refresh token, which we just grabbed from, the administrator sent it to me. And I’m going to say, Execute. Here’s the Curl that executed to go get that, the URL, so that I can confirm it, and then here’s my access token. And so the first thing to note is that the API is going to work for you just like it is in Swagger. So whatever you do here in terms of testing will also work when you implement it regardless of the language that you’re using for that script.
Now, I’m sorry, there’s a timeout there, 899, that’s default, 900 seconds for every access token, or?
It is never going to expire, well let me rephrase it, that’s a great question. So there’s two elements, there’s the refresh token which you can set to expire, or not expire–
If you’re concerned that somebody’s using it, you can kill the refresh tokens and then they’ll have to get another one. But in this case, that’s the default I think for–
The access token.
So the access token has a lifespan?
It has a lifespan.
The refresh token may not, but the access token will?
And that’s not a part of DPA, that’s just OAuth.
Right. So I’ve got my, I’ve got my token and it’s going to automatically save that for me.
So then the next thing is we’re going to come to our specification for the Management API, this is the part that’s being provided by the DPA REST. And for Database Monitor, for example, if I wanted to get information about all the database instances, the information that’s available is a part of that, a part of that API. So then how do I use it? So there’s a couple of great examples that are, that are out there, and this one actually comes from a THWACK user, right? So PowerShell of course, our friend, so let’s take a look at a couple of these. This first one here, this dpa-api, that’s going to go and get information about the databases that it’s monitoring, right?
For that instance that’s right.
OK, so we’ll just execute it and see what that looks like.
LS will not execute it.
That’s true that wouldn’t. So let’s try again and actually execute that first one and you can see here that the first thing that it did was it used the access token provided. So we’ve got that stored somewhere, hopefully in a secure place. We use the access token, the refresh token, you get the access token and then the script went out and hit the API, the same one that we saw, the same URLs back in Swagger, and then here’s all the databases that are running, right? That it’s monitoring and then if there’s information about them. But now you were talking about annotations the other day.
And that one is much more interesting. So let’s take a look at this annotations one here.
Yeah, so annotation was this great feature that we added to DPA years ago where essentially if there was a spike in performance, a change, you’d want to say, oh yeah, but, you know, we pushed something to test then, or whatever, you just want to make, you want to make an annotation inside DPA so somebody knows what the spike was really all about. It’s basically writing a note to future you, right?
That’s how I always put it. So that’s, annotations though is some of the information we can get from the new API.
That’s right, so that was added in this latest release. So here’s the script that’s actually going to go out and do the same thing, and look at that.
Right, so if you were wondering, you know, when did I do that, I know I had an annotation can I just quickly see all that, yes.
Um-hum and pull it into another system.
So let’s combine these all into one single thing. So remember, Swagger’s your friend, you’re not looking at a separate tool and documentation, it’s all together. It’s especially handy if you’re running this in Azure for example, or an iCloud instance. So the way that we’re going to do this is, I’m going to fire an alert and I’m going to call something that’s going to do something magical, right?
But I want to make sure that one, am I already monitoring it, is it already monitored by DPA. Two, does it have SQL Server on the box at all, and then if it does is that already monitored by an instance once we know some things about it, like its IP address. Like we should probably check with both DPA and Orion if we’re already monitoring it. Then if not, then we’re going to apply monitoring, sort of make sure that it’s safe, right?
OK, I’m with you.
So we’re going to tie all those things together. So we’re going to start with an alert because that’s what’s actually going to tell us that we’ve reached a certain CPU threshold. And we’ll just use the Summary view, I love the Summary view; it’s just so useful.
It lets you know.
It lets you know. So this was the Auto Deep Monitor SQL with DPA, right, so it’s going to, I’m just kind of going through the description, the rest of it, it’s going to run once a minute and it’s got a Trigger Condition, which is if the CPU load is greater than 90%, and the Trigger Action is going to be, run this safe-add dpa executable. That’s not a script, and who wrote that?
But it’s safe, it says it right in the name.
It is, but it’s also Go, so we’re going to show you how that works here in a second. But if you click on what this, on how this is configured, and we are going to go down here into the Edit action, you can that it’s executing an external program, I’m not automatically resetting it and, because if that has happened you’d want to know and be able to clear that yourself. So it’s going to execute that executable and then I’m passing in a command line parameter, which is NodeID and I’m using the Variable Insert to just add that. So that’s a SWIS query that’s getting the NodeID and plopping it in there and then it’s executing that executable. And it means, do you see passwords? Usernames?
It’s probably in the script, right?
It’s not in the script because that would be naughty. It is in the code.
So this is actually Go and before that, let’s go back in that script we saw before, the one that goes out and gets, this is the PowerShell script that we, that we executed first, right, to go and get the database instances that were being monitored. This one, wow, that really actually does have the refresh token in it.
It kind of does have that refresh, that’s a hard-coded SA password.
It came right off of THWACK.
So thanks appreciate that–
That’s good, so here you can see the sections, right, and so again, PowerShell, we’re going out, we’re getting the API token that we need.
Then using that, we pass that in, then execute the query, to get the database that’s monitored information, that then calls this invoke method and that returns the data. And one thing that the author did here that was really handy is they put these little snippets of what the data, what the data should look like.
What it should look like, yeah.
That is just really handy and very, very good form.
Comments are always valuable.
Right, but so we’re following that same form, right, we’re going to get the token and we’re going to go out and get the data. It’s a little bit different in our Go program, because we’re going to do a couple of things, right? So the first thing we’re going to do is we’re going to say, is, go get all the instances of DPA, and then do we have a related instance, because we’re passing this again, again from the alert that Orion is calling, right? So we’re passing in the alert ID and whether or not it’s a Debug. I definitely recommend wherever possible that you use in this case flags for Go, used named parameters for any other scripting language in PowerShell because, you know, parameters get mixed up and you won’t remember what works and what didn’t later on. So it’s for nodeID, does it have a DPA instance? Because we can ask Orion that because it has the relationship. You remember before, the little icon with a database button on the left-hand side. So if it’s already being monitored we don’t want to add monitoring, right?
So then the second thing is does it have SQL Server running on it? So for that we’re going to do, we’re going to get a process list, didn’t know you could do that, did you? And then after that, if we find SQL Server, then we’ll go and ask, hey based on this IP address are you sure you’re not monitoring? And then if not, add the instance. So what that looks is, I’ll run through this a couple of times. But the magic between this, the bridge between the two, is that process list. We need to know whether it’s actually running on the machine or not. So rather than creating another script, you know, inside of Orion there are all of these executables and other things that are deep down in the Program Files Directory. Well, there’s one in here that’s kind of handy. And so this is Program Files, SolarWinds Orion, APM, is RealTime Process Poller. Now I suspect, you can tell me, is this the back-end of the pop-up RealTimeProcess viewer on the front page, it may well be. But you can use it from the command line too and it’ll go out and pull all the processes off the box. So here’s how that works. The code actually executes, go get the list of processes that are running, and then it runs through them, it’ll iterate through that, and if it finds SQL Server, then it’s going to return true. If it doesn’t, it won’t. All right so we’ll come back over here to our main, I’m going to go ahead and start this. I’m going to use, I’ll use the Debugger and then we can just walk through it. And we’ll run this and see what happens in our, in our console. All right, so we went out, we got the bearer token that we came back for, right? We went and, this is, this is a SWQL query.
It’s running on the Orion API, and then we looked through it and it says, ah, yes, look at that, we do in fact, for this server, have a known, a known instance of DPA and we got that because the SWQL query lets us do that. Remember we have, we have those dots, those dot-joined values so you don’t have to do joins, so this is, I’m selecting from Orion.DPA.DatabaseInstance, but I’m getting the ApplicationReference.Application.Node.NodeID, right?
And again this is documented inside the Orion API. So this is, so that’s what Orion gives us. So then we’re going to go out and we’re going to get our process list. So this time we’ll run it again, but we’re going to do it for this other node. So we’re going to run this again out of our Debugger. You can see that we’re getting the process list, so it didn’t find it, so now it’s actually using that command line argument to go get the list of processes. Here’s a list of all the processes that are running, and sure, SQL Server is running on that node, right? So again, I’m just reusing the get token every time because if you occasionally use it it’s easier just to go get it unless you’re doing it all the time, because you don’t have to worry about it expiring. So now I’m asking DPA, hey are you monitoring it? Here’s all the data that came back from DPA about monitored instance. And it says, oh, you know what, you’re right, that’s not a server that DPA knows anything about. So I’ve determined that SQL Server is running on the box using the Orion API. I’ve figured out that it is definitely not being monitored individually by DPA, and then it went out and then added the monitoring instance by essentially setting a set of parameters in this, this DPA class, and it’s passing in a couple of things. A lot of these for SQL Server are ignored, so I’ve kind of trimmed it down. So this one is basically saying I’m adding a SQL Server, the hostname is being passed in as a part of that, we collected through that API. And then you asked before about credentials, I’ve just got a little class here, the secret class which knows how to go out and get secrets that are stored on that machine so that they’re encrypted, and nobody can get access to them. So that’s about it for the API. You know it’s kind of interesting to think about the driver behind the decision to release this in SolarWinds products now and then the difference between what IT executives expect, or maybe just believe, and what technologists actually need.
Yeah right, I was surprised at the reaction when we demoed this at Tech Field Day, one of the delegates tweeted, SolarWinds was the first vendor he’d seen that actually explained how machine learning relates to operations, rather than being a marketing term.
But really, is that a surprise?
Don’t talk about the IT Trends Report.
The IT Trends Report 2019 and what was the top trend of interest for executives?
Hmm, AI and machine learning?
OK, that’s true, but the survey includes admins and execs, so the real message this year, to me anyway, is that it’s time to learn more about AI, and ML, and even deep learning, and understand how it might apply to your day-to-day operations rather than planning to add machine intelligence to everything that you do before New Year’s.
But didn’t it say something about containers as well in 2018?
That’s true and that might actually be the most valuable bit. The Trends Report has been running for several years now and it’s more valuable viewed over time than in any single year. So, for example, we watched containers go from, we haven’t heard about it, to it’s bleeding edge, it’s over-hyped, and now you all are telling us, well, yeah.
Yeah, it’s just one more thing that you do.
Yeah, but containers didn’t go away.
No, they went from an executive priority to just another tool used by IT, not just developers in cloud.
Right, because measuring future versus reality over time, lets us diff out admin needs versus executive interest, and also lets us call out periods of disconnect. It’s also global so that we can see the differences between regions.
Well, it’s data, and it’s all about the data.
Yeah, so check it out and let us know if you find it useful. It’s, it-trends.solarwinds.com.
Yeah, back when we started doing it hyphens were not shameful.
Now I blame DevOps. We’ll put the link in the notes. And I don’t know if I’m happier to see anomaly detection added to DPA, or that machine learning might be coming to other SolarWinds products in the future?
It’s going to change how we do IT ops, right, but in a good way. I’ve been wondering, do you ever think that data science, in general, will solve the real questions that is facing humanity?
You mean like world hunger, or global warming?
Not just those, but I’m also, I’m concerned about things like, is this a picture of a cat or croissant, or what about is this a Chihuahua or a blueberry muffin?
So, would you settle for knowing what kind of dog you are?
Awesome, OK well then, let’s wrap with that. And just as a reminder, if you don’t see the chat box over here to your right, it’s because you’re not with us live, so swing by our homepage which is lab.solarwinds.com, check out upcoming show dates, and bring your questions. Is that about it?
Yes sir, for SolarWinds Lab, I’m Thomas LaRock.
I’m Karlo Zatylny.
And I’m Patrick Hubbard and thanks for watching.