Monitoring Insights for Database Managers - SolarWinds TechPod 057

Episode Transcript

Announcer: This episode of TechPod is brought to you by THWACK.com, the SolarWinds community for IT pros. That’s where you’ll find the Monitoring for Managers forum, where we look at monitoring from a manager’s point of view. Join the conversation at THWACK.com/m4m

Leon: Databases are one of the fundamental areas in IT that cause never-ending friction between the folks doing the work and people managing those critical teams. The reason for this is largely rooted in the opinion of practitioners that if management would just learn a little about the tech, they would understand why the requests were so essential. Meanwhile, leadership spends a not-insignificant amount of time wishing those same experts would learn to explain what they’re doing in plain English, or at least ways that didn’t induce immediate narcolepsy.

Leon: This episode of SolarWinds TechPod is part of our ongoing miniseries on Monitoring for Managers. I brought along my fellow Head Geek and database aficionado Kevin Kline. Together, we’re going to break down the essential things about databases and database monitoring that a manager might need to know without all the sleep-inducing techno babble. Kevin, welcome to TechPod.

Kevin: Hey, Leon.

Leon: Okay, so before we dive into the main topic, normally this is where we do some shameless self-promotion, we let our guests talk about the company they work for and the kind of work they do and the stuff that they’re doing on the side, so are you working on any special projects or you got some appearances coming up or anything that you want folks to keep an eye out for?

Kevin: Well, yes indeed. I have two conferences that I’ll be appearing at throughout the end of the year in person, actually. So, that’s a difference. SQL Live! 360 in Orlando and then also DEVintersections will be in Vegas in December of this year, so I’ll be speaking at both of those events. Then in terms of special projects a couple interesting things to mention. One is, I’m actually doing in my own webinar series, a three-part series on building persuasion, credibility, and leadership skills. So if you’re interested in this topic today, I think you’d probably be pretty interested in attending those webcasts as well, and then finally, I am working actually, sadly, I am working on two simultaneous book projects right now.

Leon: Oh my gosh. Because sleep is for the weak.

Kevin: Yeah. And I am weak, I’ll tell you. So in the early January, I believe “SQL in a Nutshell: Fourth Edition” will be coming out, my best-seller. Woo-hoo.

Leon: Yay.

Kevin: Yeah, and then probably some time in Q1 of next year, a new book by APress, Professional Database Migration for Azure SQL.

Leon: Great. Okay, and thank you for not having the “…for complete morons” or anything like that. Okay, wonderful. Well, let’s see. I am not 100% sure what my end-of-year travel schedule looks like. It has been pretty much since the end of the before times that I haven’t gone anywhere. I might make it out to, I think re:Invent, but that’s still up in the air, and that’s about the only travel I know of and as far as side projects, I actually do like my sleep occasionally, so I’m not coming out with anything big soon. However, I would hock my book that I wrote a couple years ago now which is The Four Questions Every Monitoring Engineer Is Asked. It’s good for any time of year, but it’s especially geared for the sort of springtime and the Passover season, but it’s not really a religiously based book, but you can pick it up wherever fine ebooks and manuscripts are sold.

Leon: So okay, with that out of the way, I want to dive into our main topic. Again, monitoring and database monitoring for managers, and I want to start off with looking from let’s say the management view inward toward the team. What do you in your experience, what do you think non-technical managers who lead engineers need to know so that they can be better leaders?

Kevin: Yeah, and this is a critical question, particularly today in the age of the great resignation, where so many people are either a) refusing to come back into the office and because of that they’re looking for new employment, or b) they just need a change and they’re moving on. So if you’re not implementing or acting upon some of the better practices as a non-technical manager of technical people, they’re going to be unhappy and they’re going to start looking elsewhere.

Kevin: So one of the first bits of advice I would encourage people to implement in this situation, and it’s funny because I was a technical person promoted into management over technical people and it took me years to learn this lesson. But the lesson is stop interrupting us. So, what do I mean?

Leon: I love it. It’s so true.

Kevin: Isn’t it? Right. But I learned this lesson probably in the late 90s, early 2000s, and later on, lots of reports started to come out and talk about this concept of flow, right? Where it takes … If we context switch between going to a team meeting and then we’re back to writing code or we’re back to doing some work with our networks or as an admin or something, it takes us a while to get in the flow of things and it could be five minutes, it could be twenty minutes. But it takes a little while, and then you get into this really good place where you’re getting lots of work done, you’re really in the zone. You’ve lost track of time, and then boom, comes the next interruption.

Kevin: So in my own case, what I had started to do just because my predecessor had done it this way was we held our team meetings, whatever, like if it was a review or if it was a status meeting, we always held those at 10 a.m., and so that meant you come to work, you have a cup of coffee, you chat with your colleagues for a few minutes, and then you start to work, you get about 45 minutes in and then you’re interrupted to go to the 10:00 a.m. meeting. You come back, get to your desk, get another cup of coffee, start to get into the flow, and then it’s lunchtime.

Leon: Right. You missed the part where you bag on either the project or the guest speaker or whoever was in the meeting or whatever they were talking about. So that chews up a few more, like the walk from the meeting. Also is then you get your cup of coffee, you bag on them a little bit more, and then like you said, it’s lunchtime and where did the morning go?

Kevin: Yes, exactly, and although the amount of free time that I was providing to my team was essentially the same, when I learned that I was reintroducing that need to get into flow, that time for context switching, what I realized is move the meeting in such a way that it abuts another breakpoint in the day.

Kevin: So once I realized that I was killing everybody’s productivity, all of our meetings happened either just before lunch or just after you came back from lunch. So that in the morning, yeah, we didn’t get any extra time from the 10 a.m. meetings. We still had a cumulative two and a half hours, but because it was the meeting and then to lunch, there wasn’t that context switch that happened, and everybody was able to get a little bit more work done at a higher productivity, a little bit more in the zone.

Leon: You know where I see this now? It’s really interesting that in the developer world, I’m seeing a lot more tools where the information, the monitoring, the alerting, all that stuff, is happening, is bolting into the IDE, into the developer’s environment itself, so that they never have to leave the code environment to see how the code is operating or to see that error message or to see that request, that poll request or that change request or whatever it is, it’s all happening within. Same idea, this is where I get my work done. This is where I’m the most valuable for the company. Every time you pull me away from this, you lose money, you lose features, functionality. Like you said, that flow time, so everything you can do to keep me in the place where I am the most useful is a benefit to everybody.

Kevin: Absolutely. Yeah, and in fact, previous years of our IT professional surveys that we have done, and that other organizations have done, have kind of reiterated this lesson learned. Please just give us enough time to get some work done.

Leon: Right, right. So what else? Okay, so that’s the big one. What else is on the list of things that non-technical managers ought to know?

Kevin: Non-technical managers really struggle with the issue of level of effort, and so a lot of times, if you’re a consultant, you probably experience this on a daily basis where you talk to a client about the new app you’re going to build and they want the carrying capacity of an oceangoing cargo freighter. They want the speed of a jet aircraft and they want the comfort of a limousine, and it’s like, we know a) that’s impossible, these are contradictory sort of requirements you’re asking for, but b) how long does it take to make an airliner or a cargo vessel? It takes years in some cases, depending on the kind … So one of the things that happens so often that always kind of sets my spider senses tingling is when a user or a manager says, “That shouldn’t be too hard. Let’s add a field to this form, right?”

Leon: Yeah.

Kevin: Yeah, and it’s the kind of same thing in which when someone … Just they don’t have an awareness of systems. They tend to think of life as a series of discrete objects and interpersonal interactions. The kind of thing where you meet somebody and you’re like, “God, that person must hate me,” and you’re not thinking about the fact of his wife is getting treatment for breast cancer and he just spent all morning with her, she’s exhausted, he’s exhausted. They’re hopeful, but life is hard right now, and that’s just kind of his face, right? Or you look at a form where you’re doing data entry and like, “Well let’s just put a little extra field there,” and you’re like, “Yeah, but that also means documentation. That means QA and testing. That’s got to be part of the project plans. That’s got to have all kinds of analysis in terms of downstream and possibly upstream implications.”

Kevin: So one of the things that, lessons learned from the world of agile and development is you don’t estimate in kind of a natural looking curve of effort. They use something called a Fibonacci sequence, which is you take two numbers and add the two previous numbers to get the next number in the sequence. So your first level of effort is one, and then one, but then your next level of effort after that is two, and one plus two makes three, then the next number is two plus three is five, then eight … And then it gets bigger and bigger very quickly.

Kevin: So it’s the kind of thing that, “Oh all right. Well maybe on this particular form, adding a field is not a big deal because the database already has that field added.” But before long, “Oh, let’s add this new feature set,” where we’re going to have a whole new set of pages on our dashboard and the SaaS portal, you’re not talking about a single day. You’re talking about quite a bit of time, and so a lot of managers don’t really understand the actual level of effort and I think this is one of the reasons why it’s so important for you as a manager to try to build warm relationships in which you can trust the feedback you’re getting from your team, knowing that they’re not steering you wrong.

Leon: Right. I also think that another element of agile is the idea of story points, that … And it doesn’t have to be … Like story points is a very ephemeral concept, right? You could ask your technical team to give you a pile of Monopoly money, and then every time you ask as a manager for a feature or for a deliverable or for a project, they will charge you, they will tell you “That’s going to be another $1,000.00, hand it over,” until you realize that all the money in the pot is gone and you’ve spent it. But whatever it is, whether it’s story points or pickles or I’ve heard all sorts of creative ways of defining it. Having a way of quantifying, of as a team, you as the manager, them as the individual contributors, defining the level of effort so that you have a visual of … Because on this project, adding a form is a one dollar item. But on this form, it happens to be a $300 item.

Leon: A long time ago when I was part of a team that was installing phone systems and people said, “Well how much is this going to cost?” I said, “The first phone we install in your office is going to cost $15,373. The second phone we install is going to be $5.25 for the handset.” Well why is the first phone so expensive? Because we have to put in all the cabling and the server and the rack and the switches and all of that stuff. You can’t have even the first phone without that one.

Kevin: Right.

Leon: So it really … I’m just installing a phone, but the context matters, and working with your team to help get a trustworthy, meaning that you trust, to your point, you trust that they’re telling you accurate information, they’re not sandbagging, they’re not exaggerating because they know you’re going to ask for more later, whatever, and at the same time, they trust that you understand what you’re asking for, and that when you say, “No, I do need the $1,000 item, I really do,” that they can trust that you understand what it is that you’re asking for and you mean it. So, those are all big things.

Kevin: Just as a sidebar to that too, I find that in general, IT people tend to be really exuberantly optimistic. Even the most pessimistic or introspective and introverted IT person, when you ask them, “how long will it take to get this done?” you’ll see them cock their head and get that dreamy expression of deep concentration on it, and what they’re also doing is they’re thinking, “Well, if I was in a perfect world, and all I had to do was work on this project, huh, I bet you would it only take me four hours to get that done.” So they come back and say, “four hours.” But then you’ve got to back off of that and get a little bit higher perspective and say, “Oh, well you know what? You don’t have four hours of uninterrupted time very often in your day, Mr. dotNet. So, this might take two or three or four calendar days, business days, for you to actually get around to getting all of that stuff done.”

Leon: Right.

Kevin: That sort of thing, so although we often don’t think of ourselves as being optimists, the way we just think about our processes is when we have to give an estimate, we tend to do it under the best of conditions, but we don’t live under the best of conditions.

Leon: Right.

Kevin: So you can just get some grossly, wildly inaccurate estimates because of that.

Leon: Yeah, and even when we’re estimating based on not optimal but nominal, meaning the normal conditions, just the ones that every day, your everyday isn’t actually every day. Your everyday is probably one of the most better moments of the week, but there are a lot of really bad ones along the way and a lot of distractions.

Leon: The funny part is, I live on both sides of this divide. When my wife asks me how long it’s going to take to get a particular thing done or I’m working on some improvements in the house or whatever, she will routinely double and add half again to whatever timeframe I tell her. “How long is this going to take?” “I think it’s going to take me two hours to do.” “Okay, that’s like six and a half. Okay, good, all right.” Whatever it is, she does that all the time to me. Meanwhile, one of my good friends, Dennis, will tell me, “Oh, we’re going to do blah blah blah blah blah,” and I’d be like, “Okay, now Dennis, I want you to think about that again, but remember that there is gravity and oxygen. We’re doing this on earth with our normal rules of physics,” and he’ll stop and be like, “Oh, right, okay. So then we can’t do this,” and you can see him rewriting the entire project in his head because he was doing it in some bizarre fantasy world that exists only between his ears. But yeah, it is amazing.

Leon: Another thing that I would love for non-technical managers, specifically with database, databases and database monitoring, is to recognize that even if you have a really solid grasp of monitoring in a regular application, IT operation sense, you know what matters. CPU, RAM are high among them, there’s other things too. For databases, it’s an entirely different world. The things that matter for databases don’t matter in any of your other situations, and the things that matter in the other situations are really not as important as they are. So I would like you to take a minute and talk about that.

Kevin: Yeah, that’s an outstanding point, Leon. How many times have you opened Task Manager to do troubleshooting and then had to sit back a little bit and think about, “Okay, so what’s Chrystal doing on my computer at the moment? Does she have locks on those files? And how about Liz? What’s she up to?”

Leon: I have never thought that. I have never, ever thought that.

Kevin: Yeah, exactly. But that’s the name of the game with databases. Databases aren’t just an Excel spreadsheet writ large. Databases are an extremely elegant but very sophisticated and complicated way to allow thousands of people to simultaneously work on the same mission critical data to your system. Probably the systems that are responsible for millions of dollars of revenue for any sizable organization. So, huge problems are introduced into database systems not because you’ve got a CPU problem or a RAM problem, but because it’s a multi-user highly concurrent system, where you have to make sure that a transaction never disappears into the ether. A transaction handling data is either fully committed and fully recoverable at any moment, but also fully reversible as well. That’s the whole reason relational databases were invented back in the late 70s, early 80s, was mainframes would lose track of you, your data, all the time, and although we’re peasants and we don’t have millions of dollars, when you lose track of our money, we’ll grab the torches and the pitchforks and we’ll storm the castle, right?

Leon: Right.

Kevin: So people said, “Okay, we’ve got to figure out a way to make sure that this doesn’t happen.” That’s the kind of thing you have to consider in monitoring databases. CPU is real, memory shortages do happen. So you need to monitor those kind of standard bare metal infrastructural components. But you also need to watch things like all of the different kinds of wait statistics that are accumulating. Locking and blocking are massive issues. We have other kinds of issues unique to databases like deadlocks. So, there’s quite a bit in there that is unique to databases and also as part and parcel of that, how good your skills are with designing and writing for a database can determine whether you have a lot of that or a little of that, and so that’s why it’s also really important for managers to know who’s the best devs on one of our application teams who can write good SQL as opposed to great JavaScript or what have you. Because they are not identical skills.

Leon: Right, and this is where monitoring becomes not a cost center and not a warning system, but a development tool that will help you build better code because okay, let’s say that your best dev didn’t work on this particular aspect of the application, and so you’re seeing a lot of deadlocks or locking or waits associated with a particular function. You can go back and say, “Look, we’re seeing this. Let’s review the SQL that you’re writing, that you’re using your code. Oh, I see. Let’s put that into a query plan, let’s see how it executes, and let’s start to play around with it. Let’s rearrange it,” and we realize that there are ways to make that same code run more efficiently on the database side which makes the application run more efficiently, et cetera, everybody’s happy. But you know that from monitoring, not from necessarily great code discipline or a super duper IDE that has context-based auto-complete or your best ability to copy and paste from Stack Overflow or whatever. Like monitoring really can be an improvement tool, so yeah, I love it.

Leon: The same thing goes for that connection between storage and the database which I think a lot of people overlook. They think, “Yeah, storage, storage. Like whatever, it sits there, what’s the big deal?”

Kevin: Right. Yeah. So many people tend to think of storage as just a simple commodity, and it’s actually … I do a presentation called Top 10 Mistakes that DBAs Make, and #10 on the list is thinking about storage only as volume of data that you can put your databases into. But storage is so much more than that. It is also the speed, the central keystone of performance because of latencies and things like that. It’s massively important. And then there’s all kinds of other aspects of data that typically a DBA or a dev is not going to think of because we have that one hard disc that sits there in our workstation but when you’re talking to the SAN administrator, the SAN administrator is usually thinking about their storage, which becomes your storage, in terms of cost. So you’re like, “I need a 50 gigabyte allocation on the SAN so I can deploy our new database. We only think it’s going to be 20 gigabytes by the end of the first year but I’ll have room to grow.”

Kevin: So the Sand admin says, “Hey, great, no problem,” and then they go behind the scenes and give you three discs in a RAID 5 array, which as it turns out, RAID 5 is not good at all for write-heavy workloads. It’s quite a bit slower, and it’s okay for read-heavy workloads, but for write-heavy workloads, RAID 5 writes a parity bit for every bit that it puts on the disk, so it’s doing a heck of a lot more writes to get the same amount of work done.

Leon: Right. Now I’m going to pull the conversation back, because remember, this is for non-technical managers, and while we could spend all day talking about the write latency and the cost of 12 and things like that, the point is is that as you’re specking out an application as a manager and the budget is coming into play, you’re probably … You have already sprung for as many CPUs as the team has asked for, you have already sprung for all the RAM. Do you really need a flash array which is three times more expensive in terms of interoffice cost or however you do your money? Isn’t RAID 5 good enough? I mean the storage is there, and you’ll have it and the reality is no, it’s not. That storage in a lot of cases is your price versus performance fulcrum, and the thing as a manager to recognize is to make an informed choice, to understand that that choice of where the storage goes isn’t just about dollars, all storage is not created equal, and it’s not only just about how fast the disks are, because again, as a consumer, I could say, “Well, I’ve got a 3600 millisecond read time, or I’ve got a…” Whatever it is.

Leon: No, it’s not just about the read time on the disk itself. It’s about again how it’s allocated, is it allocated all at once or is it expanding storage? That can take a little bit of performance hit too. Is it on Flash, is it on something like that. And that you purpose, you select the right storage for the purpose. That to a manager is the important thing.

Kevin: So you mentioned the price-performance fulcrum and this is even more important today now that we have a lot of people, if they’re not already in the cloud, they are considering it, right? And adoption for using the cloud is going way up. So one of the things I’d encourage everyone to remember, this goes back to Leon’s point about monitoring isn’t just about when things break. Here’s an example. If you move to the cloud and you’re so excited that you’re going to be saving all kinds of money, remember that the cloud has built-in latencies on top of all the other latencies you get from the systems themselves. That’s because it’s on the internet far away from where you are sitting at your desk. So let’s say it’s got a … By default, it’s going to have a 30 millisecond latency, just to go across the wire. So now, you have systems that are much slower than they were before because you opted for the cheapest kind of cloud storage that they have. It’s still hard disks, let’s say for example.

Leon: You chose poorly.

Kevin: Yes, you choose poorly, and so the thing that a lot of people discover much to their chagrin about being in the cloud in a way that they often don’t think about on-premises is that you can get the performance you need but it’s a balancing act. You’ve got the fulcrum on which price is balancing on one side and performance on the other, and you can turn that dial in the cloud to up your performance, but boy oh boy, watch the dollar signs just rolling by like Bugs Bunny’s eyes in the cartoons when he hits the jackpot on a slot machine. It really gets expensive fast, so I’d encourage managers, if you ever have a part to play in setting requirements, don’t just set functional requirements for your apps. Set performance requirements as well.

Kevin: So a lot of the time today, users think if the website is slow, that’s the same as the website is down, and so spending a few extra pennies at the get-go, at the onset, could save you a lot of recrimination from your users later on who are like, “This is barely usable.” You’re like, “Look, you have to wait one second for the screen to come back and respond.” They’re like, “That’s too slow.”

Leon: It’s forever. It’s forever.

Kevin: Yes, exactly.

Leon: And I also see this as a function of who built the cloud environment in the first place. When you’re talking about on prem and I’ve talked about this before in other venues but when you’re talking about on prem, there is a team of people who have been building the environment. The network engineers who understand how data is flowing and how the firewall rules and how the traffic patterns are going. We’re all deeply intrinsically involved, and even if it wasn’t for this project, the environment itself was architected in terms of network, and the storage likewise, the storage environment. The first three arrays are specifically for these kinds of applications and then the storage administrators have these and these permissions are on these kinds of LUNS and things like that. And then the developers come in and they explain my application is going to do X and Y and Z and therefore all the other experts, storage, network, security and so on get together and they say, “All right, we’re going to give you this security and this storage.”

Leon: You see who’s involved. With cloud native applications, a lot of times the person building the application, the developer, is the one actually going and doing the clicking in Azure or AWS, and the problem with that is that they’re using a set of menus and they’re not even sure what the options necessarily mean or the downstream impact of it. So my advice to the managers listening is that even though your developer probably understands everything they need to know about how the application is working, that doesn’t mean you shouldn’t get other experts involved to just give a look over on what is the security you put on that S3 bucket? Oh, it’s wide open. Yeah, we won’t be doing that. That’s a bad idea, or whatever it is. Have those experts, the network engineers. I know it’s not your on-prem network, but they still have input to how your AWS-based application should be built and architected from a network standpoint.

Leon: That takes us to another storage-based topic, which is backups. Let’s talk about those for a little bit, and I’m going to start off with a quote from our friend Karen Lopez. She’s been on TechPod and she’s been on other SolarWinds shows frequently and she recently tweeted out something that I think is just really, really well put and it’s concise. “You don’t need backups. You need restores.” And that’s the emphasis I think that we both want to make is that people put an unnecessary amount of importance on the back … “Have we been doing our backups lately? Let’s make sure we get those backups. Let’s grandfather them in, let’s make sure that we put them over here.” It’s not the backups you care about, it’s the can you restore it later. That’s the part you actually want and that’s the part that too many people don’t even think about at all until it’s too late.

Kevin: That’s so very true, and I experienced this firsthand back in the day. SQL server ironically, and a lot of databases are this way. They will let you back up a database that is corrupted and then you pull out what appears to be a perfectly normal looking backup file, you go to reapply it, to restore it, and SQL server or the other database platform will say, “Sorry, I can’t restore this corrupted file. Go back to your last good backup.” You’re like, “Well when was that? When did this problem happen? I don’t know.” The first time it ever happened to me was in the mid 90s, and I had to go back a couple days to find one that worked, and oh, it was just a massive heartache. So it was at that time that we actually started to build in a practice that you have to first do all your preventative maintenance checks, corruption checks and things. If it passes, back it up, but if it doesn’t pass, there’s no sense of backing it up because you can’t restore it.

Kevin: The next thing I learned was because the ability to restore is what’s critical, not the backup itself, we actually started to build in restore tests as part of our process. So on a regular basis, we would not only take the backups and take a look at the backups, make sure they work, but we would restore those to secondary servers and make sure everything worked fine. This was a big improvement for us and one of the surprising benefits of that is that the next time we actually had a real failure, the kind that gets the CEO alarmed and the CIO is pacing in your cube, which we all love that, right?

Leon: Oh yeah. It helps me work, I’m so much more effective when I have a senior executive looking over my shoulder.

Kevin: Exactly.

Leon: Yeah.

Kevin: So that enabled me to actually say to them, “I believe that this will be 45 minutes and we’ll be done.” It was actually the deputy CIO in my case and his name was David and he said, “How do you reckon that?” I said, “Well, I did a restore test over the weekend and it took us 43 minutes. A little bit of data growth, so let’s say 45 minutes.” He was like … He was surprised, but he was impressed. He’s like, “Okay. I’ll check back in 45 minutes,” and he left my cube, and I thought at that moment, “I’m the most powerful DBA in the world.”

Leon: But it does bring up an important point for managers to understand about restores, because again, we’re not talking about backups, we’re talking about the restores, that’s the part you want, which is RTO and RPO, and those are the two things that again as a manager you want to, it’s not just a matter of saying that we need to back things up or we need to be able to restore, it’s about how fast and how much. Could you just take a minute though and dig into that RTO and RPO concept?

Kevin: Yeah, yeah, and this is even more important now for those of you listening who may be moving to the cloud and I’ll explain that in just a minute. So RPO is recovery point objective and RTO is recovery time objective. So basically what that means is recovery point is how much data are you willing to lose? So let’s say for example you’ve got a conference scheduling system that is running on PostgreSQL, and it goes down. Well what happens if you, how bad is it if you have to re-key the last hour’s worth of data into that system?

Leon: Yeah.

Kevin: Yeah, exactly. If people –

Leon: Yeah, it’s not happening.

Kevin: They’re like, “You mean the conference scheduling system went down for the whole day? Whoop-dee-doo.” So you wouldn’t mind if you lost an hour or two or even multiple hours. On the other hand, if this database system actually generates huge amounts of revenue, you might not even want it to go down for two minutes. So in that case, with this very, very valuable system, you would have not only a recovery point objective, you would also want to start factoring in high availability, so that you have a secondary standby server ready to go at any moment.

Kevin: So that’s how much data are you willing to do without if the system goes down. Recovery time objective is how much time are you willing to allow the team to have to work on it to get it back up and running. So let’s say, like our standard back when I ran the DBA team was that we wouldn’t lose more than 15 minutes of data and it wouldn’t take more than 15 minutes for us to restore if the system crashed. So that is the recovery point objective of 15 minutes, and the the recovery time objective of taking no more than 15 minutes to get back up and running.

Kevin: Again, this is the kind of thing that if you don’t regularly test it, you won’t know if you can meet those requirements. Now one of the reasons it’s so important to think about this a lot and plan for this if you’re doing a database migration to the cloud is that one of the ways that we as database people would enable this to happen within those time limits is we would do things like … We’d have the full database backup from yesterday sitting on the server, and then on top of that, we would take let’s say every 15 minutes, we would take a transaction log back up, so that we could … If the server went down at 3:30 in the afternoon, we could restore yesterday’s backup right from the local disk, all of those transaction log backups from the time it crashed until the most recent 15 minute increment and we’d be back in business in no time at all. But in the cloud, you have to pay for all that extra storage.

Kevin: So we’ve got a situation where in some of our servers, we’d keep two or three days of backups even in case there was a logical failure. Like they had rolled out some new changes to the software and that’s what messed things up, and so we have to get around the logical failure. So we’d go back two days before the deployment. Well, if you put all of those backups onto your cloud services, you’re paying three times as much as you were before for storage.

Leon: Right.

Kevin: So you –

Leon: It may be worth it. It might be worth it, but you can’t assume it’s worth it.

Kevin: That’s right.

Leon: And as a manager, you have to recognize what it is that you’re picking. You can have a really frugal dev or DBA who says, “Oh, there’s no way they’d want to pay for that.” But leadership may say, “Oh no no no no. We absolutely are willing. Those are dollars well spent. It could be. But if you don’t know what you’re picking.”

Kevin: That’s right. Don’t go into it in ignorance.

Leon: Right, exactly.

Kevin: Those two things are very important, and in fact, now you’ll see in … There’s for example a really outstanding excellent set of PowerShell scripts for SQL server DBAs, it’s called dbatools.io, and its creator has been on some of our other sessions in the past, Chrissy LeMaire, and as one of the modules in there, in PowerShell, it does exactly what took us months to figure out of pulling a backup, restoring it elsewhere on another SQL server, and giving you the time that it takes to do that. So now you can build kind of a trend analysis of how long it takes for you.

Kevin: So this is now kind of a well-accepted industry best practice. But on top of that, there’s kind of an overall framework for RPO and RTO, and that is set within your SLAs, your service level authorization or service level –

Leon: Agreements.

Kevin: Agreement.

Leon: Mm-hmm.

Kevin: And there are even some people now who are going a step beyond that, SLO, and that I don’t even know what that means.

Leon: Service level objective maybe?

Kevin: Objective, that’s what it is. Yeah, exactly. So the SLA is kind of the letter of the, the contract to the letter, and the SLO is the contract to the premise of what you want to achieve. We as a business have these goals and objectives for the SLO. For the SLA though, just to illustrate the point that it’s kind of to the letter of the agreement, the first time I ever had to sign one of those, I was very, very worried. Because we didn’t really know and we didn’t have any empirical evidence to support that we could meet the needs that were laid out in the SLA, and so we cheated. So SQL server –

Leon: Like you do.

Kevin: Yeah. Like everybody does. It was mutually satisfactory cheating but what we did is SQL server … And many database platforms allow you to create different … Kind of a hierarchy of objects inside of your SQL server database, and for example you might have one file group that’s called primary and another file group that’s called secondary, and you put some of the tables of your database into the secondary file group, knowing that that is actually an older SAN that is hard disk based rather than the newer one that mostly is going to give you different kinds of really fast cash. But everything in that secondary file group is kind of lukewarm data or even cold data. It’s the financials from two years ago that yeah, that’s used in some trending, but you don’t really query it regularly at all. So we kind of did the equivalent of that in which we put the tables and the indexes that were used most often into one group that would come up right away, and then those that were used less often, usually like for month-end reports and things like that, that went into another group that took a little longer to restore outside of the balance of the SLA. So that’s how we were able to get things moving to satisfy their needs. But –

Leon: Very nice.

Kevin: Like you said, at the time we were doing that, there weren’t really many monitoring products or tools out there, and so that’s when I as the DBA began to think about, “You know, what would really be great would be to know empirically based on my performance data, can I meet these SLAs? What is normal behavior?” Because how can I truly figure this to be abnormal if I don’t know what normal is?

Leon: Right, right. All right, so I want to pivot away from the things that we wish managers would know about. I think that covers a few really good areas and talk about some of the struggles that we’ve heard managers, specifically non-technical managers who are working with database monitoring folks or database experts, what are some of the struggles you’ve heard out in the field from those folks? Now you mentioned one earlier and I’m just going to bring it back again, how do I know that my team is telling me the truth? How do I know if they’re BSing me or not? So what do you say to that?

Kevin: Yeah, this is where I think good team dynamics, friendly relationships, really, really come into play, and sometimes people are stressed out and they’re going to fudge the numbers, but if you are a cohesive team, that helps a lot. People are generally going to try to be as honest with you as they can. Of course they may be overly optimistic or that sort of thing, but the other thing that in talking to a lot of leaders who have gotten to a place where they feel really comfortable with what their team tells them, they do a couple things. One is they kind of … They have some element of play together. There’s something about the way they’re working together in which they enjoy each other’s company. So I’ve heard of, in fact I recently talked to one team leader who during the pandemic they weren’t able to go in person to any kind of user groups or things that they had previously done that were enjoyable, so they decided as a group, they were still working in the office and they decided as a group that they would have kind of a lunch and learn once a week, where they’d pull a video from a big conference or from Microsoft Learning or Pluralsight and they’d watch it together.

Kevin: Then after a while, it turned into something fun because it got from where it was just informative to it got to be a little bit fun, where one of the team members decided to buy a giant two buckets of Kentucky Fried Chicken for everybody, and then it became, “Hey, hey, next Tuesday is the chicken train,” and they’d send messages around. They were just having fun with this idea of, “Hey, that chicken was great.” Then it was like, “Oh, now I’m looking forward to this time together with the team.” What developed there was this kind of crosstalk, where one of the team members was especially good at SQL and another couple of them were really good at the different coding languages they were using in the IDE. So sometimes, they’d lean across the table as they watched a video with finger licking goodness all over their fingers and one of them would say, “Hey, you know, I don’t think that would work in our environment because we do XYZ,” and the other person might say, “Oh, well counterpoint, if we changed how we’re recording this in Microsoft Team Foundation Server, we might be able to get away with it.”

Kevin: So what happened was the manager couldn’t tell if the person was BSing them on some highly technical point of Angular or Node.js, but the way the team worked together, everybody was kind of cooperatively coming to conclusions and coming to consensus that way. So if the manager said, “Can we get this done by next Tuesday for this emergency request?” The team together would say, “Well, no, I don’t think so,” and then somebody else on the other side of the table would say, “Well, yeah we could, but we would have to not do any work on this and such part that we’ve been tasked with.” The team leader could say, “Oh you know what? I can make that happen. Let me talk to the boss further up the chain.”

Leon: Right. I mean like all relationships, a good marriage, et cetera, et cetera, communication is the key. I would also say that what you’re talking about is a level of trust. People talk about trust all the time. You have to trust them, you have to trust them, but there comes a point where functionally what does that I mean, do I trust them? Do I trust them to hold me dangling over a building? I mean that’s not really part of my job description typically. So what does that mean, and you hit upon something which was that within a team, there is trust not only that people’s strengths will be acknowledged, but also that their gaps, if you want to call them weaknesses, will be respected. Meaning that if you can trust that you’re able to say, “I’m actually not good at that.” And that it will be respected if somebody says, “I’m not good at that and I don’t want to get good at that. That’s not something that I’m interested in.” Versus somebody who says, “I’m really not good at that. I’d like to improve and I’d love all the resources you can give me to get better at it.”

Leon: Either one of those statements should be perfectly okay for somebody on the team to say is that, “I really don’t want to ever touch the financial systems. It’s just not my thing. I know it’s a database, I know it is. But there’s enough particularities about it that I’m not good,” and trust within the team is that people can bring their whole selves and won’t be dinged for it in some way along the way, that somebody will say, “I’m not going to keep assigning these to you because I know you hate them,” or whatever it is. I think that that’s important, and having an environment where those things occur means that a manager can trust that the team isn’t going to BS them to cover their own butt or to cover their own perceived inadequacies.

Kevin: That is a fantastic insight and I am just really impressed with that insight to be honest with you because that is in fact what I’ve seen great managers do. I’ve done that in my own management positions where I had to manage teams and I say had to because I don’t love it, I am good at it, I can succeed as a manager, but that’s not my favorite thing to do. But I’ll tell you what, you really encapsulated that really well, Leon. The recognition that some people are great at one thing and not so good at another and I as your teammate want to see you succeed. So if I’m good at something and you’re not, I’ll step up and I know that you would do the same for me. So that is a really winning combination.

Leon: Right, exactly, and it lets us move on to another point I think which is how do you keep a team from filtering out the bad news? I’ve heard managers ask this is I want to foster an environment of openness so that the team will tell me. I’d rather know the bad news upfront first, but I think as a manager, there’s certain things you got to do to … Sorry to use a cliché, but to walk the walk. Because you’re asking a lot from a team to constantly bring you the I just screwed up or this isn’t going to work or we’re not going to make our deadline. There’s some foundation that you have to put down before experienced IT professionals who’ve been through the ringer a couple of times are just going to skip lightly into your room and say, “I am just letting you know boss that we’re going to fail.” Like that’s not happening, so what’s your take on that?

Kevin: Yeah. So true, and I find that the farther up the organization you go, the more prevalent this behavior is, and having worked in some very large enterprises that had thousands of employees working on administration and dev kinds of things, usually in my experience, it’s not at the small team level that you find people not willing to share the bad news. So let’s say it’s you and me and a few developers sitting down to discuss status. It wouldn’t be I think uncommon or uncomfortable for some of our teammates to say, “You know, they told us we got to get this done by the end of the month. That’s only a few days away, and we are not going to hit that deadline.” You and I would look at each other and say, “Yeah, this is significant. Thanks for sharing, and then with the team, we would talk through what can we do to minimize that impact as much as we can.

Kevin: Where it really starts to break down is when that team leader or small group leader then goes and talks to their manager who has a position title of director or something kind of fancy. People get … I don’t know what it is about the transition from the people in the trenches to further up into the executive ranks, but it often breaks down at that point, and sometimes, it’s because the manager who manages a small team will soften the language because they actually want to protect maybe that one person on the team who’s responsible, because if you can identify one person, you can fire one person, right? So they do it from a position of intending good, but what happens is that manager doesn’t really give the full impact of the bad news to the director or the senior manager or what have you. So the senior manager gets the message that we might miss the deadline at the end of the month, instead of no chance we’re going to hit this deadline. Then the senior manager takes that up a level and says, “We’ve got a couple parts of the project that are struggling a little bit but we think we can catch up, and you know what? If we just cut out two weeks of testing on this project plan, then we can get it done in time.”

Kevin: When in fact of course we all know that when you test, you will probably find something that is wrong that needs to be fixed. So cutting out two weeks of testing is based on the ridiculously optimistic, foolishly optimistic idea that there’s nothing we got to repair. So it’s kind of like a snowball rolling downhill, except it’s going the other direction, up to the most senior executives, where one small little change in the emphasis at the lowest level becomes a medium-sized change, becomes a complete untruth by the time you move two or three levels up.

Leon: Right.

Kevin: So that’s why I always try to give bad news not in terms of the project or the people in the project, I try to give bad news in terms of the business outcomes.

Leon: Mm-hmm. Right.

Kevin: And this is something I know you know a lot about as well, Leon. So we could probably have a whole separate conversation about that.

Leon: Yep. We really could. I want to emphasize though that part of the reason, as an individual contributor, that I resist giving bad news, certainly in a skip level or higher when I’m talking to somebody, I actually looked at one of the C-level executives once when they said, “How’s it going? It seems like there’s a problem but I can’t tell what it is.” He was looking for me to give him insight and I said, “I cannot tell you because you are a cannon, not a scalpel.”

Kevin: Right.

Leon: He’s like, “What do you mean?” I say, “From the level where you’re at, you will want to fix something, but the level of granularity you have is to blow an entire wall out of the building. That is as precise as you can be. You cannot be surgical about anything because you walk into a room and the decisions that you make are grand and sweeping and all that. So I can’t tell you because it won’t fix it, it will make it worse. But I really want to fix this. I understand. I will deal with it. Not because I’m a lone wolf and not because I’m just going to tough it out or whatever, but because telling you actually will make the problem worse,” and I think that a lot of individual contributors have experienced that and so they avoid saying anything that they think is going to again, snowball out of control and end up making the situation worse.

Leon: But to your point, putting things in business terms, it’s not assigning blame, it’s this is the business impact of the current situation. Well I don’t want that, and if you continue to drive back to the business and say, “With another $300,000.00, with another two weeks, with another five people, we would be able to avoid this outcome,” or whatever it is. You avoid them trying to implement really weird fixes or whip the troops up or what if I … It’s the old joke about the student and the kung fu master, and the kung fu master says, “It will take you a year to master this form,” and the student says, “What if I work twice as hard?” And the master says, “It will take you two years.” “Well what if I work three times as hard?” “It will take you five years.” Sometimes that’s not the right lever to pull, so … Love it.

Kevin: Yeah, very true, very true.

Leon: So another thing I’ve heard managers complain about is picking between options or opinions. We know that IT people come in all shapes and sizes and their personalities can vary wildly and even vary in context. You can have somebody who is very soft-spoken, almost what you would call on the introverted scale in most cases, but you bring up their pet technology or their pet area or the one that they revile the most and all of a sudden out comes this raging tiger and you think, “Wow. Sarah is usually so quiet and whatever and she must feel really strongly about this, so I’m going to do whatever she says about this.”

Kevin: Right.

Leon: Loud does not mean correct. Loud doesn’t necessarily mean that it’s the thing you want to do, and yet you have managers who have no way of navigating these really interesting IT personalities and passions that come out. So what’s a manager to do when again they’re not technical, they can’t pick an argument, they’ve got two differing opinions. The joke, you have one data center, two technicians and four opinions. So they’ve got these massive number of opinions that they have to pick through, no technical basis to pick it on merit. What’s a manager supposed to do?

Kevin: Again, that’s a great insight. This is a big issue and if you’ve been trained at all or if you naturally have a kinder heart, then you want your team to come to consensus. We know that managers who function as coaches more than as dictators, they’re a much more effective manager, and so that means you are soliciting opinions from the professionals on your team. They are professionals after all, and IT people love being respected and acknowledged for their intelligence and their thought processes and things like that.

Kevin: So when you ask for opinions, you actually are building kind of a positive feedback loop that makes the people that you’re asking an opinion feel good, and the fact that you’re asking for consensus also makes the team feel good and valued, so that’s great. One of the things though that happens in that situation is like you said, usually the loudest and most outspoken person wins, and that’s because a lot of people who are more introverted and there are more introverts in IT than other personality types, they’re more likely to say, “It’s not that big of a deal and what they propose works okay. Mine is a better idea, but it’s not worth fighting for.”

Kevin: Well, there’s some things you can do to really help make this more balanced. There’s some books recently that have come out that have kind of gone into this. There are also some of the different kinds of psychometric tests, like Myers-Briggs, and I’m not saying predicate your entire approach to life based on those kinds of things, but seek advice where it seems to be applicable, and so one of the things that you can do to really improve that consensus building process is that typically, introverted people don’t like to make spot decisions. So if we had a regular standing meeting on Thursdays and Leon walked in and said, “Okay people, we’ve been bouncing around the idea of maybe we can find a second network provider so that we meet this corporate goal of having redundant network trunks into the data center. We thought about it, we talked about it. Let’s decide today which one we’re going to do.” You basically have ensured that most of your introverts are not really going to contribute to that conversation. So for example, when I was the president of the Professional Association for SQL Server,” I did that a couple times with the board of directors, I said, “Let’s decide today. We only meet in-person once a quarter. Let’s figure this out.” Well what happened was two or three people ran the conversation and it was done, and I was not happy with that outcome.

Kevin: So we regrouped and what we decided to do going forward was that anything that has to be decided on is never going to be a spot decision. “Tomorrow, we’re going to make the decision, and today, I’m telling you, we’re going to make that decision tomorrow, here’s all the information we’ve collected about it. Now digest that, think about it, maybe write out what you would like to see happen in this situation, and everybody has to say something. If you say nothing, that assumes not agreement, that assumes that you disagree, and so if you have nothing really to say, then that’s what you’re going to say.” So when it comes to around Sheila and Sheila is now asked to … Which of these different plans do you support, she needs to either say I support one or the other or my own proposal or I don’t really have an opinion on this. But it’s not okay to just sit back and not say anything, and when people who are introverted and not prone to speaking up know that they have to at least say a few words some time in the near future, tomorrow, the day after tomorrow, they’re going to participate in that conversation.

Kevin: So there was an old commercial from back in my younger days, and I think yours too, Leon. The commercial was when such and such talks –

Leon: EF Hutton.

Kevin: EF Hutton, that’s the one.

Leon: Yep. EF Hutton. When EF Hutton talks, people listen. So a couple things I want to say. First of all, whether you are a Myers-Briggs person or a StrengthsFinder person or some other as you said psychometric test, the first thing is to understand that as a manager, you need to have a way of understanding people and personalities that are different from your experiences. That they exist and they process information and they interact with the world differently. And the second thing on the other side is to recognize that any of those tests or assets are frameworks. They’re not solutions, they don’t give you the answer. They simply give you a context in which to process the experiences you’re having differently, to recognize that to your point Sheila isn’t quiet or shy, she takes a different approach to processing information and then expressing an opinion on it and that it is valid, but it’s not the same as Tony, who’s right out there, out and loud, and in your face about things, which also isn’t wrong.

Leon: I want to recommend for those people who happen to be on the more outgoing side of the world, that would be me, is the book Quiet by Susan Cain is a great way to understand not just why or how people are again, we’ll say introverted, but also the benefits, the strengths that folks who tend to think a little slower and a little bit more carefully about things bring to the organization. So that’s important also.

Leon: Kevin, we can go on about this I think literally all day. I think we have another six hours of material here to talk about and I believe that we will get to that in a future episode at some point, but for right now I think this is enough to be getting along with. So first of all, I just want to thank you for joining me for a really incredible conversation.

Kevin: Great fun. Really enjoyed it Leon, thank you.

Leon: And thank you to everyone else who was listening.

Kevin: Folks, we know that you have a choice of podcasts. There’s so much content out there to consume, so we want to make a special note that we appreciate that you’re spending the time with us, listening, and hopefully contributing in the future. I know that I always love to hear comments from those of you who are listening and on behalf of Leon, I would say we would both love to have a conversation with you on Twitter, LinkedIn, other kinds of social media or even via email if you like. So thank you for taking some time out of your day to listen to ours.

Leon: Absolutely, and you can find Kevin on the socials, you can find him on Twitter at @kekline, that’s K-L-I-N-E, and also on LinkedIn, and of course on the SolarWinds community THWACK.com, and you can find all that in the show notes. I would like to remind everyone if you like this episode, we’d love it if you would follow, rate, review the podcast, smash the like button, whatever you’re going to do. For SolarWinds TechPod, I’m Leon Adato.

Kevin: And I’m Kevin Kline. Until next time. Thank you everyone for listening.