Home > SolarWinds Lab Episode 65: Orion Database Maintenance

SolarWinds Lab Episode 65: Orion Database Maintenance

In this episode, Head Geeks Thomas LaRock and Destiny Bertucci will explore the exciting world of database maintenance for your Orion installations.

Back to Video Archive

Episode Transcript


Hello and welcome to this edition of SolarWinds Lab, where Destiny and I are going to talk about maintenance. That's right. Wait, we're doing a lab on licensing? No, no, no, no. Nobody wants that. Today we're going to talk about something much more exciting. Database maintenance for the Orion installation. Yeah, okay...and how is that more exciting? Come on, what could be more exciting than the rebuilding of indexes and updating of statistics and the amazing world of database backups? I could think of lots of things more exciting than those things. Like a nap would be more exciting than those things. A nap? Hey wait, have you been watching those videos of me pretending to be a DBA? Hmm. I can explain everything. Yeah, okay. Just introduce the episode, Napster. Hi, I'm Thomas LaRock. And I'm Destiny Bertucci, and today on SolarWinds Lab, we are going to help you understand how to handle maintenance tasks for the Orion Platform database. That's right. It's the most exciting database episode ever. We're going to talk to you about the things you're going to want to do prior to installing Orion, the things to do immediately after the install is finished, and then we're going to show you some things that you need to do on a regular basis in order to keep your Orion databases running smoothly. I can't wait to see that. You were the one who showed me. Okay, quiet down...they don't know that. Oh. [Static sound effects] Okay, let's start at the beginning. I want to talk about all the things that we want them to do before they even install the Orion Platform. Okay, well that's right. I've seen many customer support tickets that could have been avoided just by taking care of a few things prior to the installation. A few bad choices now can really cause a lot of pain later. Yes, absolutely, I mean you could find yourself trying to rebuild indexes and working like a hamster in a wheel, and performance still goes nowhere. Exactly, and an ounce of prevention... Is worth a pound of bacon. I knew he was going to say that. Yeah, because it's right there on the teleprompter. Anyway, let's talk about the things that we want everyone to review before they install the Orion Platform. Okay, we're going to talk hardware, storage, and user permissions. Permissions? How are user permissions related to database maintenance? Well, just you wait, you'll see. First, let's start with something easy. Let's talk to them about how to make sure they have enough storage for their needs. Right, storage and space. That always seems to be an issue. Mostly for customers that decide to start monitoring all the things without understanding just how much data that may mean. It's not just a collection, though. It's a combination of collection along with retention that can really kill your space requirements. Right, so you want to size your database accordingly right from the start. Now, as a DBA, I'm a huge advocate for creating my database with all the space I know I need by end of life. But that is really the way things were when we lived in the physical world. Now it's a virtual environment. And that means we really only get to size for our needs right now, with the idea that we'll grow later. Yep, and that's what happens. But still, it's a good idea to be able to tell your server team, "Okay, we need this much now and we're probably going to need "this much more in three months from now," and you keep that dialogue going forward, letting them know you'll need more in advance of the time that you actually run out of space. Okay, so how do we help our customers arrive at this guesstimate for their sizing needs? Well, the official version is this: [clears throat] There is no fixed formula to determine how much space to allocate for the Orion SQL database. We recommend a 30-day evaluation of your environment to observe the growth of your database and then adjust the retention settings. Okay, so you can guess the size by knowing the size before you install. Yes...and the database size depends on the number of the following: Number of elements you are polling, how much traffic is going through these devices, the retention setting for your statistical polling, syslog messages, if you receive them, and how much of them, trap messages, if you receive them, and how much of them, and then you also have retention settings for all of these as well. You've got the top talker 95 percent, DNS, IP address lookup, wireless, universal device pollers, you've got hardware health, and other modules, and I'm pretty sure we're going to show a link that we could shorten this to its course. That is a lot for them to consider. Yeah, so here's the best guess we give. If you have a full SLX with NetFlow, syslog, traps, and seven days retention, then allot anywhere from 10 to 50 GB to the Orion Platform database. Because generally speaking, if you have an SLX and NetFlow, that means you're monitoring a lot of interfaces, not necessarily that you're... the NetFlow database, because it's not, but just you have a lot of interfaces that you're probably trying to keep track of which takes up space within the Orion Platform database. Right, and so again, as a DBA, I'm going to put my DBA hat on and I'm just going to say, if I hear from 10 to 50, five-zero gigabytes, I'm going to say, "Give me all 50 to start," right? Because I know that if my database is being backed by spinning rust somewhere, I want as much contiguous disk storage as I can get. And you want to do that in order to minimize fragmentation from growth events and the indexes for tables, etc. Right, and like you said, this is why we... The advice we give is, "Let it run for a week "and then adjust from there." So, by itself, that advice really doesn't help as much as you might hope. But if you, say, allocate your 50GB, then monitor and adjust from that as your starting point, that's actually not a bad approach to take in the absence of, you know, there's no knowledge about your particular environment. Okay, enough about the size of the database. Let's talk about hardware. Right. Didn't we do a Lab about storage for Orion? Yes, we did. Lab #48, the Database Dilemma. And that's where I made Leon cry. Oh my. Such good memories. Okay, that sounds like a lot of fun...like a lot of fun... but let's bring us back to the focus here and tell us the result of that episode. Alright, so what we did in that episode was we took the Orion Platform, the database, and we put it on different RAID levels, right? And then we said, "Okay, now we have "these different RAID levels, let's run "one of the biggest workloads that we can think of." And what was that? That was the configuration wizard. Yeah, that can be a bear, there's no question. So, that's what we did. We had a server that was all RAID 5, we had a server that was all RAID 10, and then one all-flash array. We ran the configuration setup. Now, I don't want to spoil the result, but I will tell you that RAID 5 was the worst performer. Which makes total sense because Orion is almost a pure OLTP workload. That's write-intensive, so any RAID configuration that has a write penalty will perform worse than a RAID configuration without a write penalty. Exactly, yet we still have customers that come to us, they open up a support ticket, they complain that they're having storage performance problems, and then everybody seemed surprised to find out that, "Hey, maybe RAID 5 wasn't the best option." So, save yourself the headache later, configure your storage properly from the start. And Orion isn't as memory- or as CPU-intensive either. You can see the recommendations that we make, and we go out of our way to talk about the size and RAID. Not so much about the memory and CPU. Okay, so let's talk about that last item for pre-installation, user permissions. Right, so there are times when Orion wants or needs to interact with the database, and they don't have enough permissions. This is often due to those nasty DBAs not giving us the correct logins that we asked for in the first place. Hey now, look...you're the security geek. You should appreciate the principle of least privilege. I certainly do. Alright. And I'm not asking for more than what I generally need. In fact, we even list out the specific permissions needed. And this isn't just for the install. It's the permissions we need for ongoing maintenance. You know, that thing we're supposed to be talking about here today. Yes, of course. You can see those permissions listed completely on our support site. And I've seen the Orion Platform install have issues with maintenance due to the lack of sufficient permissions. So take care of that prior to the install and have less maintenance headaches later. And just bribe your DBA with some bacon if necessary and get those logins created correctly the first time. And again with the bacon. Don't judge me. [Static music] Okay, so we've installed the Orion Platform. Things are running, it's collecting the data and metrics, and it's consuming the data, and it's presenting dashboards and all the Belgian Whistles. Wait, did you just say Belgian Whistles? Yes, but don't worry about that right now, let's focus on what happens next. We want to help them understand what steps to take immediately after the install is complete. Okay, I like to do this by breaking things down into two groups: Tasks internal to the Orion Platform database, and tasks external to the Orion Platform database. Yeah, that's a great way to divide and think about those tasks, because those are sometimes two very different groups or people. True, but we do have a large number of Accidental DBAs out there that end up doing all the tasks. Still, it's helpful to think of these tasks as internal and external. So let's talk about the tasks that Orion will do itself, internally, and the tasks that need to be done externally. You mean by an experienced enterprise-class DBA? Well, if you have one of those around, sure. Otherwise, you'll do. Okay, good, let's break them down. First up, the external tasks. And the most important one of those: database backups. Yeah, you're going to want those at some point in time...get it? Yeah, I got it. And you're right, those backups become important because they are needed for your recovery plan. You can't have a recovery plan unless you have a backup. And as a DBA, if you can't recover, you can't keep your job. So you're going to want to make sure that those backups are being done, and this is for all the associated databases that you may have. Yep, and the Orion Platform may have more than one database, depending upon what products you have installed. You are going to want to make certain that your backups are actually taking place. So if you have a DBA or a server team member that's already in charge of the backups, you want to verify with them that these databases have been added to whatever backup process is currently in place. And if you're the Accidental DBA and you need to do this for yourself, you're going to want to configure probably a maintenance plan inside SQL Server Management Studio. And one caveat about this is to make an effort to avoid having the backups taking place during or even near the same time as the internal maintenance. Okay, so let's shift to those internal tasks. So, what tasks are really handled by Orion itself? Orion does a bunch of work itself. There's no need to have these configured externally. The Orion Platform will do this out of the box. But you can configure the timing of the maintenance. Okay, so let's just take a look. Alright. [Static music] Okay, so the standard stuff that most DBAs would configure in a maintenance plan - things like rebuilding indexes, updating statistics, backing up a database, you know, those are things that... Orion can do some of those things, like just out of the box. But there's some extra stuff that Orion's going to do for you, which is why we recommend that you use the internal maintenance. Yes, because the same place you set the timing for the maintenance to run is also where you configure retention. And we've already talked about the role retention plays for performance. And the KB article doesn't list the details, but here's what the Orion internal maintenance does for you. One, it will clean up the orphan nodes that you have there. Important, yep. And it will prune, archive, and roll-up data according to retention settings. Yep, you mentioned that. And it will update statistics for tables and indexes. Right. And it will rebuild and organize indexes as needed. Okay, so let's show them where it is. Alright, so we're in the settings page and we're going to scroll down to the actual Polling Settings because that's of course where your database is going to be. Well of course, naturally. When I looked at the screen, and I wanted to know where the database maintenance tasks would be, I thought to myself Polling Settings. Exactly, and so if you scroll past those Polling Settings, that is where you're going to see your actual Database Settings and the things of which we were talking about. It was actually almost above the fold if you have a big enough screen. Correct. And so the archive time is, as you can see, out of the box, is at 2:15 a.m., and then the checkbox is for the index defragmentation. Now, if you're an upgrade person, you may want to come in here and make sure that you have that checked because if it wasn't there before in the database, it might not be there. Just a good place to see to make sure that you are doing those statistics and index updates. Now, archive time, though... To me, archive would mean something more like a backup. But what does archive really mean to the user here? This is when it's going to perform all the tasks that you see here. So that's your roll-up tasks, your indexing, your pruning of your nodes, things of that nature. The four that we were just listing, this is when that starts. Yeah, so pruning nodes isn't listed here, but that gets done. Correct, yes, because that's...that's done on the backside of there, so any of your orphan nodes that are within your database, that is how you prune them correctly out of the database. Okay, and so what else is happening on this screen for us? So you can see that you have your database retention settings that are set up there. So you have your hourly, and...or, your detailed, your hourly, and your daily. So if you have things such as compliance reports that you have to keep detailed information to check out anomalies, or if you're applying patches and you need to make sure you verify the information for more than seven days, this is where you're going to increase it. Okay, what's this interface baseline calculation? So that is when it gets done for the baseline. That is a load on your database, so that's why we offer you, do you want to do it every day, or just, you know, certain days and times? Because remember, the time for all these starts at 2:15. The more accumulation, or the more that you have in your database, the longer this can take as well to be completed. As we see here, indexing is for 3600 seconds, which is about an hour. But it can go up to 50,000 which is 14 hours. 14 hours of index maintenance...hmm. I mean, is that a problem? You know, I'm not going to say it's a problem... I'm just going to say... Yes. So, 14 hours. If you have to wait 14 hours for your index jobs, you know, defragmentation to get done, then you really got to look at what your retention settings are, how many elements you're really monitoring for. Down here, the biggest ones, right? Syslog and trap messages. You see there, default of three days. So I mean, here is where you would come in and start making some changes and see if you can change... because likely, if indexes are taking that long, it's because of the size of the database. So the first thing you look at is not, "Hey, can I just truncate some tables?" It's, "What am I retaining? "And can I really make some adjustments here?" So, a lot of the database maintenance stuff happens here. And, as the Orion administrator, this is where you really want to do a lot of those internal tests. Definitely. If the DBA insists on doing index rebuilds or whatever, you have to let them know that Orion itself is going to be doing a lot of tasks, alright? And you've set it up at 2:15. You also want to coordinate with the DBAs, though, because you want all of this to happen and then you probably want to have a backup done. And the reason you want to do that is if you ever have to roll back to that point in time you don't want to wait for all this maintenance to have to happen again, right? So consider this consider this your end of day. At 2:15 this will run. You should have an idea of how long it would take, and then you should be able to say, "Alright, DBAs," or whoever's doing the backups, "Could you start my backups at around say 4:00 a.m.?" Or whatever the time would be. Because then, for you, that would be a nice, clean way, if you ever had to restore, place to start. You wouldn't have to redo your maintenance, right? And I think this is a really good way for people to understand you don't want to overlap those two functions. The DBA really wants to do index maintenance. Trust me, we do. We want to do the index maintenance, and we want to update your stats, and we want to do all the backups, and we want to control all of that simply because we want to make sure it's getting done. And if somebody comes to us and says, "Hey, was this done?" I need to have a list of it, and I only have a list of it if I'm the one doing it. So, it's important to have that communication so that the Orion administrator says, "No, no, don't worry about the index stuff, "we're supposed to be doing that internally. "If we have performance problems, then we'll have to go "and look at it together later. "But I don't want you to have a duplication of efforts "in your environment, having people trying "to do the same task, essentially, twice in one night." That makes complete sense. [Static music] Okay, so we've taken care of the stuff before the installation. And we installed Orion and looked at stuff to do immediately after the install, and we've even danced with the devil that is the DBA. So what's next? Okay, next is the ongoing monitoring of your Orion Platform and environment. Oh, you mean the standard performance tuning and monitoring that any good DBA enjoys doing each and every day... for the remainder of eternity. Well, sure, there's that option. But I also mean looking at logs, searching for errors, and diving into the database to make certain that we aren't overloading things. Sure, okay, that sounds great. Where do we start? Alright, well, let me show you. We're going to actually go into... C:, ProgramData, SolarWinds, Logs, Orion. For some of you, you probably have been in here, but if not, you'll want to go to this View and make sure that the Hidden Items is actually checked, or you're not going to see the ProgramData folder. Okay. Now, once you get there, I'm going to go into... the actual logs and then Orion. And what I'm going to search for is swdebug. That's going to actually bring up the S W maintenance log, so that we can actually see if it was completed, what the fragmentation was before, the tables that are coming across, and then what I like to do is I scroll all the way past all that great information, and I automatically look for... Have these objects been completed? And so if there's an error, an error would appear here? Correct. And you can also...so how long did this take? For example, it says 2:16, so when did this start? 2:15...that was the default. The default was 2:15, right. This is a very small database. Small. Tiny. So it took only a minute and a half and then went through all the tables. You're getting some really good information here. Yeah, and it'll tell you each table that it went through with the indexing size as well as how it rolls into the data and everything that comes across there, so... It's pretty intense on the information that it puts out through there. You can get more information if you go to the log adjuster because you can increase it verbose, debug mode, so that you can actually see the queries that are coming across there, things of that nature that comes across. Awesome. Something that I like to tell people, though, is before you start an upgrade is look at your maintenance logs. Make sure you don't have any known gotchas that are happening, your maintenance is completing, so that you know that your rollups are going through and that you don't have, like, a table that's just huge because it hasn't been rolled up. And then that's going to cause a problem when we're transferring data or creating new tables that it'll come across. Yep, good point. Alright, so the next thing that I like to look for is kind of monitoring the growth of things, or how things are going across there. So we want to check, like, your table sizes and things of that nature. Okay. Alright, so I'm going to go through here. And we are going to grab the Database Manager. Alright...and is that the name of it? Database Manager? Yes. Alright. So also, as you can see here, we had the Database Maintenance. Some people just key in here, since it's there, and will use this before they do the upgrade. Okay. Run the maintenance to make sure it goes complete before they start an upgrade, and that just starts that archive that we were talking about. You can start it during the day, or however, you're wanting to do it. Okay, so Database Manager is a tool that gets installed with Orion, and it's a way for you to kind of browse through the Orion database itself. And we're going to look for things like size of the table. Right, and so you can hit Add Default Server, and what that's going to do is use the credentials that are behind that we're using. Okay. So, since they're encrypted now, things of that nature, it's just an easier way for it to come in. Alright. So once it's there, you can see the databases that are present, we can actually right-click that, hit the Database Details, that's going to pull out all the tables and how, you know, what the size are, etc. And so a lot of this information you could also get from SQL Server Studio... It's just that... Not everybody might have... a management studio installed in their...on their... server, or whatever. I guess what my point is...we've made it easy for you to use whatever tool's at your disposal. Right, and what you have access to because sometimes the DBA's not going to let you to have access to the SQL Management Studio, or to be on the same box, so...just one of those things to try to look for. So if we go over to the tables, we can then see the index size, which is key, right? Like, we can actually make sure that it's being adjusted accordingly. And then we can also see the database size also, so we can see if we need to adjust accordingly and base our tables based on trap, syslog, anything that's coming into there that could be high. Right, okay. And then it tells you how much unused space is, and it tells you the row counts. So these are all information that you would get out of the SQL Management Studio. That's right. But it's more of a, like you said, it's a user-friendly, for somebody with SolarWinds, Accidental DBA also, for you to come in here and to at least know that, hey, I can access it because this also verifies your rights too. Oh, good point. Alright? And then, something that we can also do is check our polling settings, because when we're doing this, a lot of times people have additional polling engines. And so when you have those out there, a lot of people will let people actually discover and add new nodes in, but not explain that you may have multiple pollers. So it defaults to the main, you know, your main Orion Platform, right? So you're going to have, if they don't change it to, say, poller one, poller two, etc., then... I have ran into this a lot, you will have, you know, 3,000...you know, 30,000 different types of nodes that are on monitor one and you have, like, nothing on your additional polling and you find out that you have all these network discoveries that are going on and you've got to hone that in. So to check that, we can go into the web console. And we will see the settings page once again, you're familiar, and we will go down to the polling engines themselves under the Details. What that's going to allow us to see is the information of how much is the weight of the poller. So, how many elements is it actually polling if you have SAM, additional things like that on there, it's going to show you their weight count, etc. So this is going to show you the engine status of this one, and it's going to say that there's 254 elements that is on this, and it's at three percent of its same application poll rate. The hardware rate is zero percent. But here's the key one: This is the actual polling rate for your NPN. So it's at two percent from what you can tell there. And the more that you have down here, like this is the additional poller here, it tells you this same information. How much the total element count is, as well as the elements for your volumes, what you're monitoring, and it segments them out. So I like to tell people, like, this is a good way to load balance. If you go into your managed nodes and you hit the polling engines, then it'll, if you say change polling engines, it'll give you kind of an element count there. Okay. Doesn't tell you the weight of, like, how much is on there, what are they being polled for. So that's why I like to come to this page because then I can actually see the, kind of, the health of that poller. Because it may only have a thousand nodes on it, but then you find out, well, it's also doing all the SAM functions on there, and so its weight would be at 99 percent, but this may actually be at four percent polling. Yeah, that'd be very valuable information to have. Understanding how the environment is actually collecting the data because you want to know, if you feel that it might be a bottleneck, I mean, it'd be valuable to know it's the sampling rate that is actually higher. So this is a great page to... How often would you check this? Daily, weekly? So, I usually check this depending on how many people I have in there, who has rights to do discovery. If it's just me that has the discovery ability, then I'm the only one that's checking it and I will check this before I do a discovery so I know where to place and how to break up my actual discoveries themselves to bring in nodes. Okay. If there's more people that are using these, and if you've already done your initial discovery that's going out there, then I would say check it about once a month. If you're noticing issues, though, or if you are seeing that your polling engines are getting behind, this is not at 100 percent for your completion, and you're noticing some lag, then you need to, kind of, increase that, and you need to also be aware of who has discovery, because we have seen a lot of times where passwords which, I know you know I would hate it, that were getting shared, and then you see accounts that are being used and they don't understand, they're just trying to hurry up and put their stuff in and they think they're helping you, but really they could be offsetting your poller and your envisioned plan. Okay, do we have any bonus material we can share? We can show you the Data Gap Analysis tool. Data Gap Analysis. It's pretty intriguing. I like all of those words. You do? Alright, so when we go to that one, we actually go into the Program Files, SolarWinds, Orion. So we don't have to go into that...the Program Data one, that's hidden. And you look at the applications that are set up, and here is that tool. There's also a Database Response Time too, and that actually tells you the response back and forth from where this is at. So if we open this, I always run these as an administrator, because sometimes they won't actually run. And so we were talking about this one, on how the errors that we can look for, if there were the response time, the thresholds, the minimum polling completion... And what was your take? My take was, this was wonderful. Like, this is the type of analysis I'd want to see into the inner workings of the application that I'm actually trying to administer, in this case, Orion. So, when you get this information, and you can just see, so here's starting the analysis of CPU load detail. And you get an idea of if there's low polling completion. Basically, you're looking for these errors, right? That's what we're looking for, up in the upper left, it says error. This is just an easy way for me to go through and figure out, you know, "What is it that needs my attention, "if anything, right away?" Definitely. So you can select all of these, look for errors, and then figure out what the next steps are. Exactly...and the main point that I like from this is that when you start seeing those data, those gaps in data that's coming across there, and you feel like, you know, where your polling completion is actually there, you're looking at your database tables. You need to see, okay, is it gaps, is there something network-related that's going on that we're missing things? Are there things that are, you know... Is the unreachable, unreachable? Is that poller only? I need to make sure the database is okay. This is a way for you to be one-on-one with the database and say, "How fast can you get me information?" Because that's how fast we're reading it, right? And that's also how fast we're able to write it, to come across there. So when you make these executions, you're actually seeing, like, is there anything that's wrong with these tables? The interface traffic detail table is a larger table size, right? So that's going to kind of give you a little bit more information that's going to come across there. And so you can see here, we got no errors, right? Right. Basically, this, to me...this is a tool that's understanding is the problem inside my database or not? And that's an important bucket to understand. If you're going to be the administrator for Orion, and you're in charge of database maintenance, or just the database in general, this is like... Number one question, this is the bucket. Is there a problem with the database, or is there a problem somewhere else? And this data gap analysis tool is going to help you figure out if that problem is in the database. And this is what I like with the interface traffic. When you do the detail itself, it'll tell you, because we put the statistical information that's in there in the database of, like, what's its complete polling rate. And so you can see that on the side, it completes 99.99999 percent of the time, and this is the information that's been there since this has been within your database. So if you start seeing this and it's saying 50, 40, 30, things like that, you need to start looking into the information. Play around with this tool, I mean, like, this is here and it's going to help you out so that you can kind of self-diagnose, self-help yourself, and kind of be one step ahead with what's going on within your database. Alright, thanks. [Static music] I told you database maintenance was going to be exciting. Okay, you know what, it was exciting in its own way. What way is that? A weird way that only a DBA could understand or love. You're just jealous that I am America's Most Exciting DBA. Okay, that's not a real thing. And nobody wants their DBA or database to be exciting when it comes to this stuff. A boring routine is just fine. Fair point, but I'm keeping the title. Oh gosh. SolarWinds Lab, I'm Destiny Bertucci, And I'm Thomas LaRock, America's Most Exciting DBA. [Music] [Record needle drag] Still not a real thing. [Music]