Home > SolarWinds Lab Episode 38: Get Your Monitoring Squeaky Clean for 2016

SolarWinds Lab Episode 38: Get Your Monitoring Squeaky Clean for 2016

It’s the end of the year and you have used up your vacation (or didn’t get your request in as fast as the rest of the team) and you’re thinking about trying to catch up on social media or online groups that have been defunct for 16 months. Why not put that time to better use by running a few quick scripts, clearing out some dusty old data, checking long-forgotten configuration parameters, and maybe running a wizard or two? Join Head Geeks Leon Adato, Kong Yang, and Patrick Hubbard as they count down the ways to tune up SolarWinds, from log files to disk drives to database tables and more.

Back to Video Archive

Episode Transcript

Okay, Mr. Clean, I think you've missed your calling. But seriously, spring cleaning's not for another three months. Four, since he's from Cleveland. Okay, but IT people know that December marks the dark days of downtime, the time when there's year-end freezes, there's drained budgets, people are trying to use up that last, little dribble of vacation. And what that means is that in most IT shops there's not a creature stirring, not even a—well, a trackball. Oh, trackballs—remember the Kingstons? That was like golden tee. So you figure this is a good time for people to check under their keyboards for spare change and questionable AAs? I mean, you never know, these might still have some charge in them. We can try them. Let me see your mouse for a second. That is not important right now kind of. It's a good time to literally clean off your desk, but it's also a great time to figuratively clean up your act. So, you want to show folks how to clean and tune their SolarWinds environment for a better 2016. Yeah. I like it. Well then, I think we need to kick this off. Welcome to SolarWinds Lab. I'm Kong Yang. And I'm Leon Adato. And I'm Patrick Hubbard, and as always, we're really glad to have you with us again. I think you guys have this under control, so I'm going to step off for just a— No, you don't. You have found some sneaky way to get off this set the entire season, and don't think it hasn't been noticed. Well, it's also possible that I'm just letting you guys do all the hard work and then I go play with new UI and Cloud at GitHub. Well, as the "OG", the Original Head Geek, people appreciate your perspective. So, for this one you are staying put. And GitHub is part of your job description now? It's a little integration code for Collectd for Win and Librato. You like me. You really like me. Okay, I wouldn't push it, Sally Field. But that reminds me, if you really like us, let us know in the chat window that's over there off to the side. You can also ask questions and let us know what other topics you'd like to see us cover in future episodes. And if you don't see a chat window, that means that you aren't watching us live. To do that, head on over to lab.solarwinds.com and sign up for reminders of upcoming episodes. Look at how you do. You guys make me so proud. Still pushing it, Sally. Well, as the guest today, Patrick, if you had to do some how-tos on Orion and NPM, what would you do? Okay, now you're pushing it a little bit. Yeah. If it was me, I'd cover a couple of things. First, tidy up your database, right? That's going to be database maintenance, including verify that it's working, make sure that your syslog and trap messages are getting into the database, and probably export a couple of raw queries with the get SWQL link. Yes, and would you like some wine with that cheese, Mr. Swiss? Always got to pick on the API guy. Yes. Second of all, I check my hardware and make sure that you have your polling balanced and that you haven't outgrown your initial resources. That happens a lot. Yeah, a lot of times we think we're going to poll N number of elements. Then management calls and says, "Hey, I just bought another thousand." or whatever. And now I have 2N and 10N number of elements. Make sure that you've got enough hardware as you grow. Then the third is review your configuration. Are you running your maintenance? How much NetFlow data are you keeping? Are you doing logging, and are you filling up your database with debug details that you don't need anymore? Like after resolving a support issue. Or doing a ton of customizations. Exactly. Then the last thing is manual maintenance and housekeeping. I think sometimes our customers will actually run it twice a year, but certainly every year at the end of the year. That's going to be run Permissions Checker, install your upgrades, grab diagnostics, and then make sure that your baselines are good. What an insightful list. Let's do all of that today. Yes. So what made you think of all that? I don't know. We asked THWACK. All right, so what are we going to start with? We're going to start with the big guns. We're going with database maintenance and cleanup. Database maintenance. I love database maintenance, especially because it's so easy to see if it's actually working or not. Really? Yes. In fact, we did an episode not too long ago talking about logs. We'll put the link down here below. We'll give you the shortcut to this directory here. Here I am in ProgramData/SolarWinds/ Logs/Orion. What I'm going to look for is my— SolarWinds Debug. First what I've got to do is actually select into this box. SolarWinds Debug, but I'm going to take one because I happen to know that's the one from when I just ran. We'll open that thing up. And here's our log. I'm going to hit Ctrl+N, scroll down to that bottom, and I can see right here it's telling me as long as my logging is set to info that my database engine completed database maintenance and when it finished it. And I should say that this system actually has all the default logging levels, so you should always—unless you've been changing things, which we're going to talk about in a little bit about logging levels— you should always look for Database Maintenance Complete. If you don't get that, then you know that there's a problem. Right. So then, the next thing is how did I get that? Either it's going to run automatically, or I can choose to run it manually. Running it manually is pretty easy. You can either go out and search for database maintenance, if you want, from here, and run it right from there. Or I can click on the—you put a penguin on top of the Windows button? Of course I did. [Laughter] In this case, I've got it right here in my startup. I'm just going to say Database Maintenance, click Start, and it's Database Maintenance Complete. It really is nice and quick in post. Of course, for you guys it may take a little bit longer, obviously, if the database is… Real database, real data, just a little bit. Yeah, but hopefully you saw some of the things flash by and let you show what the stages of that maintenance were. Uh-huh. And again, that's going to then show up here in the log to let you know it completed. The next thing, when we're talking about database maintenance, is actually something that sometimes happens automatically, sometimes doesn't— it depends on how your database was set up— which is rebuilding indexes. Some DBAs and people who set up databases don't have automatic re-indexing turned on, some do. That I'm leaving to your database team. But there is a command, or a set of commands, you can run to automatically re-index. I'm going to jump over here, and we actually go on to the database itself. Yeah, Leon, re-indexing is always important. I remember my days in TPC and SPEC and benchmarking, and re-indexing would always get you that performance. After statistics runs. Because how many times do you make changes to indexes and it sort of improves performance a little bit? Then you go to bed, come in the next day, and all of a sudden now, it's an order of magnitude faster because statistics has run and it will change the execution plan based on the statistics. Great point. For people who are not terribly familiar with databases in general, this isn't something you're going to run every hour. You're actually not going to run this when something's broken. This is normal maintenance. Okay? But if you've got a large database, or you've added a lot of devices recently… Or the database maintenance took a really, really long time, longer than you remember. Right, or it didn't complete. These are all reasons to run it. There's lots of different re-indexing routines out there. I have one up on the screen. We've got a link to this in the show notes. It's just a series of SQL commands that you run on the database server. I'm using SQL Management Studio. There's nothing more to do except hit F5 and let it run. The only thing you're going to see is it's done is Command Completed Successfully. Or you might get errors, in which case you have to work them through based on what you're seeing. That's it, but it is something you're going to want to do, depending on your environment size, quarterly, monthly, at least once a year though—at least once a year. Our support people also say that there's a multitude of ills that can be solved just by running this one thing. And that's actually a really good question. We're talking about getting in here and messing with the database, and a lot of times we would say, "Hey, you really don't need to do that," or, "It isn't recommended." Do you feel like this is something that everybody ought to be doing all the time, or is this something that a more experienced user might want to do, or someone who maybe is new to Orion but just really is careful about making sure that they're performing maintenance? All right, so there's a few things, and I'm going to channel Tom a little bit here, in terms of—wait a minute. Tommmmm. Bacon. Much taller now. Okay. [Laughter] First of all, you've got to know that you have good backups. You've got to know that you can restore from those backups. As long as those two things are true, then I'm going to say that, first of all, no, it's not required to get into SQL Management Studio. It's not required to manage the database. Our regular routines will keep things relatively healthy. However, if you want to be in the cool kids' club, if you want to get just the most out of your environment that you possibly can, if you're a tweaker, if you're an overclocker, if you've got that mentality— Or you're doing a lot of dynamic configuration. You're doing a lot of discovery, adding and removing devices that are being monitored. Right. This is one of those things. This re-indexing is one of those things that you can do just to make sure that you are just getting every urg of power out of your environment. Not required. And it makes you feel clean and shiny. Right. It absolutely does. So this is one of those things that I would say that for most monitoring professionals this should be on your bucket list is getting comfortable with SQL. You know, in our benchmarking realm we would call that optimizing beyond Pareto, beyond the 80/20 Rule. Very good. Yes, so beyond Pareto, looking—to Pareto and beyond! [Laughter] The next thing I want to bring up in terms of optimizing is trap and syslog. Yes. This is something we talked about a couple of episodes ago when I had Jason Ferree on, so I'm not going to go over all of it again. You guys can look it up. That's one of the reasons to go to www.lab.solarwinds.com is to look at past episodes. However, checking to make sure that the trap and the syslog messages are being cleaned out. Because they're just so efficient. Okay, so there's a really great story that Bill Fitzpatrick gave me just a little bit ago that I have to share with everyone. So shout out to Bill. Which is syslog messages in a database is like you bought a dozen eggs, and then you got 36 shoeboxes, and you took each egg out of the carton, put it in one shoebox each, and you put all 12 shoeboxes in your fridge. That's about as space efficient. Syslog messages are very small and very tight and very compact, and the amount of space the database has to reserve for that message is very large, so you want to make sure that you're clearing those things out, that they are being cleared out as often as you can manage, whatever your settings are. We'll talk about how to check those settings in a little bit. But when people wonder, why wouldn't I just leave them there? What's wrong with them being in there? Because they're chewing up a bunch of space that they're not making efficient use of. Except they could potentially, right? Some of them can be very, very long. They have to be able to accept the maximum length message, but most syslog messages are short. That's why it's inefficient. Yeah, exactly. All right, so checking settings— this is one of those things that it's easy to overlook. Things just kind of take care of themselves. I do this—it depends on the environment, but if I have a lot of modules installed and I've got more than a few hundred elements, I usually review this at least once a month. Right. And if not that, at the end of every quarter. But I do try to make sure that I do that. There's a lot of things that you can tell just from the engine load status. The way to get to that for all of this polling information is going to be in Settings, so I'm going to say Settings, scroll down here to the bottom, and Polling Engines. I've got one polling engine installed in this environment here. This is going to tell me everything from whether it's the primary or not. If I have multiples, they'll just be listed one after the other. I think you actually did one at THWACKcamp not too long ago, where you broke this out into an Excel spreadsheet to be able to do load balancing between them. Right. Let's throw the link for that up in this episode as well. That was really great to balance between load balancers, to use your agents as effective load balancers. Correct. But in this case, it's going to give me versions. It's going to remind me where it is, and that's really handy if you have multiple pollers, to be able to figure out where they are. You'll also see if you're over your count in terms of licensing you'll get a message here that will tell you, "You have 8000 elements. You only have a license for 5000," or whatever it is, so that you know, because those aren't being monitored. Even though you have them in there, they aren't being monitored. The other thing is polling completion. That's important for a couple of reasons, one of which is that if you're not able to get all the way around the circle then that's data that you haven't collected in that polling period. That's an issue, but also if you have down nodes. That's going to get worse because now the polling engine has to… It's going to increase the frequency to identify when they come back up again. Right. Actually, we've got that report from THWACK that shows—you want to show it? There's a report, and I'll just show you that's it's downloadable from Content Exchange. We're going to put the link in the show notes. This is what it looks like on a very small system. You can see the polling engine name--whether it's primary or an additional poller, what its uptime is, its element count, and so on. But it adds a few other things for insight. For example, universal device pollers, which have an impact on the polling engine, but they don't factor in to the element count. Right. I think this nets out where you have issues more clearly than anywhere else, because it's going to basically break it down both nodes, volumes, your SAM tests, and overall SNMP failures one at a time so that you can figure out where your issues are. If you've got a whole bunch of SAM issues, that may be a single application where that system is offline. So that may not indicate that there's a larger issue or an ACL issue with SNMP, for example, where you might have several hundred nodes represented by only several hundred polls. And the other thing here, 2.91 hours, is that when you installed this? Because I'm thinking that you basically did a quick discovery and got this running in less than hour or two. I might have done it very quickly, but no, I had to do some regular updates and patches just to be ready in time for the show. There was a reboot, but I was prepared before this morning. I've seen you go from zero to installed in less than a half an hour, so I'm just watching to figure out when you get really, really fast. Right. Exactly. Once again, these problem columns in the report are useful simply because if you have a large number of problems it means that your polling engine or engines are working really hard to figure those out along with all the normal stuff. Okay, so for polling, another one would be when am I running my maintenance? Yes. Well, that's going to be in Settings, where everything else is, so I'm going to go to my Settings page, and I'm going to scroll down here to Thresholds & Polling and Polling Settings. Of course, the reason it's there is because— It's all the settings for your poller? That's right, all in one place. It does what it says and says what it does. This is going to give us things like overall default polling intervals. It's really handy to check. And you might have been doing debug, or maybe you had a bunch of outages related to systems that you're trying to monitor with a higher frequency and you just, oops, forgot to decrease them back down to the defaults, or to something that was normal for your environment, it's one thing to check. If you suddenly see that you're polling every five seconds, then it'd probably be something you'd want to back off of a little bit. Well, there's one more thing there. There's a button that will reapply. Because if you have individual devices that you've increased the polling because it was in a problem state, this is where you can push those levels right back again. And I've had some clients where that's a problem, where somebody got really customization happy, and this is a way to get it all back to normal. Also, think about your polling timings. Do you really need to poll data from your discs every three minutes? Probably not. Fifteen is probably good in most environments. Some it's not. I'm not saying it's for everybody. But this is where you can see where your global settings are and reapply them if you need to. Right. But now, if you're using Storage Manager, then you can actually poll less frequently because what you're really trying to do is look at traffic spikes that would indicate you were filling up a drive. Correct. And then you would spike on a threshold for a write activity to that disc, that it was actually filling up a little bit more in real time instead of polling it. Then the other thing here is Archive Time. Under Database Settings right here, Archive Time, that is when it runs. That is the database maintenance that we were looking at before. Do you ever leave that default? Again, it depends. I know that you want me to say, "No, I never leave it," but 2:15 in the morning might be a good time for some organization. Hey, it's not midnight. It's a little smarter than that. Right, if I set a default; however, if you know that you're running— Backups at 2:00. Backups, defrag, antivirus, anything that's going to conflict with this on your database server or on your polling engine then this is your chance to sort of shift that around. Pick a different time. Yeah, it's funny because I kind of go back and forth with do I run this first and then run my backups. So, for me, it depends on when my statistics run. By default, SQL runs statistics right before it runs the backups. Sometimes people turn it around, but they'll usually have it at the same part of the job. In that case, going back to the point earlier about re-indexing, then you would probably want to do this after you do your re-indexing to get full advantage of running statistics again. So earlier, we were talking about syslog and trap. There's another thing on this same screen that we want to do— just scroll down a little bit—which is how long those messages are retained. I believe the default is seven days. Some organizations need to keep them for longer, some don't. A lot of organizations look at trap and syslog as if three days have passed since a trap or a syslog message it's over; there's nothing I'm going to gain from keeping them. You either got it or you missed it. Right. I'm either dealing with it or I'm not dealing with it. Once again, it takes up a lot of space in the database, so one way to keep your database working efficiently is to maybe reduce that number, or at least consider that number and see what's there so you can set that along with all the other archiving and when the aggregation happens—summarization aggregation on this screen. That's right, and then the one last thing we talked about before is your NetFlow storage is actually pretty easy to manage here because it fortunately is no longer in the database. It's actually out in Flow Storage. The feedback from you guys on how that has improved performance— I mean, forget what it did for NetFlow, which was pretty amazing. Just what it did in terms of hammering the database and really reducing the need to run maintenance as often has been amazing. That one again here—I don't think we have NetFlow installed, but when you're at your Settings page, you'll go down to NetFlow Settings, and then you can actually get the poll information for your Flow Storage engine. That will tell you how often it's doing its folding for data retention and everything else. How long you're keeping it and how often it's being summarized and how often it's going to long-term storage. Again, that's an organizational decision. Right. Some organizations want to keep NetFlow data for a really long time, and some are like, "An hour, that's all I need. That's all I want." But since you're just throwing storage at it now, not database, it's a lot easier. Throw it out on its hand and then make that decision after. Exactly. All right, logging levels. [Explosions] Wait, are we safe in here? Yeah, I think so. But just to be safe, I got you a hard hat. [Laughter] Thank you. Why do we have hard hats? Because we're expanding the building again, and so they're set up right overhead, and so you're probably hearing some of that on the mic. Anyway, any time we hear something, especially something really long like that last one, we'll just pat the hard hats. We'll just acknowledge it. And notice that it's only one hat, so only one head geek is going to survive. We're all going to duck under it. [Laughter] But you asked the question, "Am I safe?" so I can guarantee you'll be taken probably. Right. Okay, so logging. Set logging to verbose noises. Logging, the main thing is what is normal. If you've been experimenting, maybe you were experimenting with logging, or maybe you were working on an issue you're trying to resolve; something like a polling issue where there's a lot of latency. And maybe it's related to network or something else, and you've looked at a THWACK article, for example, that's talked about how to adjust logging--in case you've never seen it--and I think we covered this like a year and a half ago. We actually covered it a couple of episodes ago. A couple episodes ago, yes. Definitely check out the logadjuster.exe. The LogAdjuster is really, really handy. There's a couple of different ways you can run it. There's actually a way you can run it from the command line, so you can have a script to turn things on and off. So you want to kind of toggle a little bit easier, not use the GUI. You can do that. This is what it looks like. For each one of these things—actually, for anything inside of Orion--you'll notice that you can actually set the value of each one of the log levels. And it will also tell you the number of files and how much you have out there. So again, even if you don't make any changes here, it's really handy because you can scroll down here and say, "Wait, I've got 100 megabyte of database maintenance? Wow, that's kind of a lot." You can also adjust those as well. You can set debug for an entire section right there. But of course, your favorite button is what? Reset. Reset Defaults, if you're working with any. Multiple people on your team have access to the SolarWinds poller, or you've got a couple people working on problem issues, probably the support guys are going to be changing logging levels. So every once in a while—every quarter, every half, every whatever— you want to go back there and just hit that button, get it all set back to regular. That way you know that you're not chewing up more disc space than is absolutely necessary. That's right. And be fearless with that button, and the reason is that if we make changes from update to update, where we really think maybe that ought to be an error or an info or we change the status on it, that's always going to be the current reset to default value. So even if you haven't done that for a while, when you push that button, that's going to be the recommended log levels for everything in your system. So what you're saying is reset to default will be the out-of-the-box SolarWinds best practices. I would definitely say that. I don't think it could be articulated any better than that. The other thing you're probably going to want to do is run the Permission Checker. That is a really handy tool. We talked about it, again, a couple episodes ago with Jason Ferree, and it was a surprise to me. I didn't realize that if you've had somebody install a module or a patch under their user name, as opposed to local administrator on the box— and this is my little commercial for always install SolarWinds tools as local administrator, not do manned administrator, not Joe the administrator, local administrator. But if you haven't, then some of the directories or files might not have the permissions they need. Permission Checker is the thing that will put it back. That's right. And usually it's not your Orion will be down. No. It would be a weird polling error, or you'd have most of your data would be working, or maybe one particular subset like CBQoS, or something else where there's a particular module, a particular DLL or something else that's actually driving that functionality where the permissions are messed up. In this case, I just ran it, and you can see how fast it is. If you click the repair button, it's going to set it back to defaults for you. It is just a really handy tool. Again, if for some reason you're not set standard and everything's not running great, you might want to just check with Support before you hit repair, but 99 times out of 100 this will clear it right up. The last thing—this is my own little PSA— Public Service Announcement— is you want to make sure that you've installed all the patches, all the upgrades, that you're current. Now, current is not necessarily about the glitziest features. Current means that the security patches, the fixes, the processing, the performance improvements are all in there. So it's not something-- When we put out a hot fix, it is generally not just for people having a problem. They're generally for the whole environment. You want to read the list. We'll put it on THWACK. But generally speaking, it's a good practice to be into. Be upgraded and be current. And the only time that you might want to hold back on an upgrade is if you're working with Support on a particular issue. Yeah. If there's something unique to your environment, they've given you a hot fix, and you're working with them, don't upgrade anything else until you work through that issue. But the other possibility might be— and I think we talked about it also in one of the other planning episodes— is you might want to stage several upgrades together, or you might have more of a DevOps approach where you actually are doing upgrades on a certain date whenever software is available. I kind of go back and forth on that. You can say, "Well, I'm going to wait until the end of the year and do all of my upgrades," or do you say, "I'm going to do Orion Core with NPM first, "and then I'm going to wait a little bit, and then I'm going to upgrade my SAM, and then I'm going to upgrade other modules"? The great thing is, out on THWACK, there are a couple of best practice guides about how to do that, also multi-version upgrades, and that's a great thing to do at the end of the year for planning. Let's say you've been ignoring your emails from the customer team and Support and you are now four versions behind. Ahh. It happens. I know. There are a lot of you guys that just run for years without making a lot of changes, and it means that you're missing features and you really should probably upgrade. It goes back to that up time, right? If it ain't broke, don't fix it.