In this episode, Head Geeks explain the ARRR Pillars (Availability, Reliability, Resiliency, and Recoverability) and how applying them in practice can help keep your company safe. We’ll cover patch reporting and analytics, cloud logging best practices, and disaster recovery using backup best practices.
Welcome to SolarWinds Lab. I’m Kong Yang.
I’m Patrick Hubbard.
And I’m Chris Groot.
Welcome, Chris, and on today’s episode, we’re going to talk about how to keep your company safe with the ARRR Pillars, ARRR as in availability, reliability, resiliency, and recoverability.
And it is great to have you with us, Chris. I’ve been trying to get you on Lab for a long time. I’m convinced that we can get you on THWACKcamp this year. Yeah, stay tuned for that but, Chris, you’re an expert in backup and recovery, you’ve been doing this for a really long time.
I have, yeah, so, my role here is product management with SolarWinds Backup, and it’s great that we’re talking about this subject because it’s, backup is the end, the last stop on the train when all else goes wrong, so happy to participate today.
Perfect, it is a key piece of keeping your company safe. But, Patrick, we were discussing keeping your company safe and all the connotations with that.
Yeah, well, I think we tend to get preoccupied with the perimeter, right? With the firewall, or maybe, we’re doing a regulatory compliance, or something else where we think about safe in terms of what the legal team would care about, “Are we safe?” But if you think about what we’re really trying to do, what all of us in IT are trying to do, is actually keep our businesses safe, right? Beyond that, it’s like, “Can we continue to transact business in the event of a failure?” Not just something goes down, but let’s say we’ve lost a critical server, can we restore that? So, it’s that balance between sort of, the careful, safety-assurance processes that we put in place for IT at the beginning versus the ones about like, recoverability SLAs. Like, “How long will it take us to actually do a restore? Are we testing restores? Are we keeping track of all of our logging? Do we have what we need to be able to fix problems and keep the business running, and safe?” Not just, “Are we, do we have a safe security posture?”
And that’s perfect because there’s so many definitions out there for availability, reliability, resiliency, and recoverability. So, let’s go ahead and define some of those. So, availability, Chris.
It really, it comes down to uptime. Making sure that the transactions of the business could continue to flow because there’s risk if those transactions stopped. It means that you are either not making money or you’re spending a lot more money than you need to. And we’ve seen many examples like that. Like at the airport, when an airline goes down. If people are lined up and they start having to do paper transactions, that’s where a lot more money is being spent.
Although, you could argue that that debate that we get into all the time about the difference between sort of, availability and resiliency, right? You can definitely say, “Our TPF facility that’s getting people on and off planes is down, and we’ve gone to using paper.” But if the business is still getting people on planes, and it’s a horrible example, and I certainly never want to be there in the knock when this actually happened.
But yeah, but if I can at least continue to limp forward and keep the business running, and then create enough air cover to go back and actually remediate the problem, and then restore my systems to actually get automation working again. That’s still something that we should be thinking about. And that is safety, that is the difference between running in a degraded state, but keeping the business moving forward, versus actually being hard-down.
Exactly, you know, you mentioned uptime, you mentioned resiliency in there, and also business needs in there. Availability, most of you guys know that as nines of availability, right? You know, the uptime of your system, of your application, and so forth. Resiliency, you know, how well do you handle those threats and those, mitigate those failures, and still stay up in there. And then reliability, how well do you handle those.
Stay resilient to it while you scale to the resources that you need while meeting those SLAs into there. But as you walk that path of availability, reliability, resiliency, you get into a space where if a disaster or a failure comes, that recoverability piece is also key, being able to get to a good, known state.
Absolutely, and understanding what steps to take, in place, to get you there, what the business can withstand while it’s down in terms of how long can you be down for, and how much can you afford to lose?
Well, and that’s an interesting point, the word, lose, there, when it comes to budget, right? I think we’ve been trained over decades to focus on availability, how many nines do we have? But nines are expensive, I mean, every time you add another one, it’s almost exponentially more expensive, you’re getting into a lot of complex clustering, and horizontal scale, and load balance, and then, versus being able to assure that your business is resilient. Which, in a lot of cases, is less expensive. If it’s the difference between, you know that, we’ve given them that metric, right? That says, “I want you to set my bonus based on an SLA of nines availability,” versus what the business actually cares about, which is, “Can you assure me that we’re not going to lose transactional data?” Right? “Can you promise me that in the event of a significant failure, we can be back online in two minutes?” Because then they can assess the real business cost of that failure and, a lot of times, that’s something that’s way more valuable to the business. And for us, as IT professionals, in a lot of cases, it’s less expensive to deliver a resilient infrastructure than it is to focus on lots and lots of nines of availability.
And that’s perfect, that’s perfect in keeping with our theme of keeping your company safe, or keeping your business safe with leveraging the ARRR pillars. And to do so, we have, like you mentioned, a jam-packed session in there. So, the first segment, we’re going to have fellow Head Geek, Leon Adato, and Kevin Sparenberg, KMSigma, join us, and they’re going to show you how to handle change management via reports and analytics in Patch Manager.
You’re bringing the super group?
Then, you and I are going to turn something that’s ephemeral into something that’s more persistent?
Yeah, I think we’re going to have a debate about opinion versus truth.
Truth or dare in the cloud is what I heard. [Laughter] And lastly, we’re going to recover with best practices, disaster recovery, and backup.
Yeah, so we’ll be coming through with myself and Carrie Reber, and I believe, Patrick, you’ll be joining us.
Yeah, well, you know, I just love talking about backup. But, actually, one of the things that’s interesting about this episode, and Kong is famous for this, we, I’m sure you guys have figured out by now that we actually sort of take turns writing episodes. Kong’s episodes, in particular, come from works on blogs, and eBooks, and other content that he does. And yours tends to be applicable regardless of the tools and technologies that you use. So one of the things that we’re hoping that you get out of this episode is, I mean, obviously, many of you are SolarWinds customers. But these approaches should apply regardless of the technology that you’re using. And it’s going to be really interesting.
Absolutely, what we aim to do is have you continue to develop and hone your foundational skills like monitoring with discipline, or as my fellow Head Geek, Leon says, “Monitoring as a discipline.” The DART and SOAR frameworks, right? So you can take these lessons that are going to be shared, leverage the ARRR Pillars across any tech construct, any project in IT, or DevOps, that you have coming up this year and be successful.
You know what, we ought to include a link in the details below so that you guys can check out the DART-SOAR framework. Alright, so we’re going to start with Leon and Mr. Sparenberg, right, alright, ready, watch this.
Hi, everyone, we have back on the SolarWinds set, Kevin Sparenberg, he is one of our product marketing managers for the network monitoring pillar tools, welcome back.
Thanks, I actually like being here, it’s a lot of fun.
It’s always fun to have you. So, in this segment, we’re going to talk about patching but what struck me when we first talked about this topic was that it strikes me as basic blocking and tackling. Right? I mean like, doesn’t everyone have a patching solution in place already?
Well, yes and no, I mean it’s what most operating systems, specifically Windows, ship with one. They’ve got, you know, the Windows system updates that are there, you just go, “Yeah, check.”
“Install.” And people think that’s enough. And if you’re running a pure Microsoft shop, no third party, there’s an argument to be made. But I still don’t think it’s going to cover everything.
No, I think that, first of all, we’re going to ignore the desktop environment for just a second.
And we’re just going to talk about the server environment, that’s what we’re focusing on but even within the server environment, relying on your server teams to approve, accept, understand those updates without any sort of centralized review is, I think, a dangerous proposition.
Oh, it’s extremely dangerous. Not to mention if you have to worry about falling outside of change windows, you have to make sure that systems are down on a scheduled basis, that this kind of patching takes place that’s actually in line with your organization.
Right, so, yeah, uncontrolled, unapproved, untested change.
Probably not the best idea but Microsoft solved that.
Yes, so Windows System Update Services is a free tool that’s available as a Windows feature. All you do is, in anything from Windows 2008 and later, just turn it on, the feature, and it says, “Okay, well we’re ready to basically pull down, cache these things for you, you’ll have control of sending them out, and then you can send them out through your local Windows servers.”
Right, and it’s, the price is right.
Free is good.
Uh, it’s scalable.
Very, very scalable.
For free. Unless you have, I will say, if you have a larger environment, you may need a SQL Server back end just to manage the data.
But, otherwise, you’re pretty free. So, again, why are we here, what are we talking about?
Well, we mentioned earlier that it’s kind of, for the Microsoft-only house, if you’re literally running Windows, and you have only Exchange on it, and you have only SQL Server, and you have only these things and you run nothing else outside the Microsoft kind of, umbrella.
Oh, you mean, those imaginary companies.
Yes, yes, the imaginary companies. So, but a lot of servers these days, if they’re running Java, they’re probably running a slightly dated version of Java, as a part of the wrapper for an application that’s installed in there. If they’re running any type of utility.
Acrobat, like any of those, you know, tools that are–
Not Microsoft, WSUS is apparently not going to help out?
Well, again, yes and no. Here’s the thing. Microsoft doesn’t actually publish that catalog. They’re not going to keep a catalog of third-party tools. I mean, that’s not their shtick, they’re Microsoft, they’re worried about Windows, and their operating systems. However, we can actually leverage that WSUS infrastructure because it’s already built and actually works excessively well. So if you’ll actually be out there and have a machine where you actually publish a patch, and you say, “This is now approved for distribution.” You can actually have that scale way out globally.
So you can actually have that level of control. Now, all we do is we take that and we kind of say, “Hey, how about this other patch that’s not a Microsoft one?”
Right, and there are a few other things that WSUS doesn’t quite fill the gap for.
You don’t get a solid inventory system, like a complete catalog of systems that you’re– and you get some status but not enough, so there’s really a gap to be filled. Which takes us to Patch Managers.
So a Patch Manager gives us what we just talked about. Inventory, it gives us a sense of, you know, what patches have been pushed out to a machine, whether they’re successful or not, right? What else, what else does Patch Manager offer?
It has a ton of reporting, it has scheduling, it has the third-party package list. Which is the big one for me, so if you actually work with, Adobe and Java seem to be two big ones that come on top. So if you actually work with them, and they need to be pushed out, whether it’s to servers because you’ve got some type of VDI infrastructure, or you’re working with something else, but they need to go out to those, we’ll actually capture that, wrap it, and send it out. So the regular Windows update “process” can do it, but we’ll actually control the scheduling.
Right, the phrase to keep in mind is, “Pre-built and pre-tested.”
So that’s part of what you’re getting with the Patch Manager system. But more than that, I think there’s also a reality about life in corporate IT, which is, I mean, we’ve been talking so far about how do we get all the latest patches out. But that’s not always the case. There are situations where you don’t want the latest patch to go out, you’ve got a system that needs to be held back. So, now you have this weird issue where to maintain auditory compliance, you have to patch it, but you have a business-based reason not to.
Not to patch it, yup.
And that puts you in sort of a, you know, jeopardy situation, which we’ll talk about in a little bit. So, more than that, I think that Patch Manager also offers you the idea that if the patch went bad, do you have the information about what went wrong, can you re-try it.
You know, you have those kinds of things. You’ve got a persistent diary of your systems and where they are at any given point in time. And when we talk about the A.R.R.R. framework, ARRR framework, you know, you want to have that record of, you know, where my environment is at. You want to have that availability because if I have, and the other piece is, that Patch Manager gives you, is rather than creating a list of exceptional MIS systems, or accepted systems, that, I’m not going to use these. You don’t have to keep a list and wonder what’s going on with the list, instead, what you’ve got is rules. If it’s running, Windows 2012 core, and it has .NET 3.5, and it’s got a directory called, you know, abc123companyinventoryprogram, then we’re not going to push the 3.51 .NET patch.
Service patch, yup.
Right, now you’re not managing a list of systems, you’re managing rules, and rules, which can then enumerate to which systems right now fit that rule set, now you can address them on an ongoing basis. I think you were saying, like, compliance is not everything is up to latest date.
No, compliance is all about making sure things that are up to date, can be up to date. And things that can’t because they’re going to impact your business need, is having really good documentation about the why and the, “When we’re going to try and get it up to there and what our blockers are.”
Right, so compliance is just basically note-taking compulsively.
So, that’s, I think enough talking. I really want to get our hands into the system. So we’re going to take a look at Patch Manager. Full disclosure, this is not a deep dive into Patch Manager. We have other Lab sessions to work with that. However, we want to just do a quick overview. So, over here in our system, first of all, we have some machines, you can see that we’ve got a few different machines, not a lot, it’s a Lab.
It’s a Lab, yeah.
You know, but, you can see that we have the machines. We have a package we’re going to push to. In this case, I think we’re going to use Notepad++. Again, this might be more of your, you know, desktop environment, but it doesn’t matter. And the point I want to make here is that you can use your patching system not just for updates, or security fixes, or whatever, but also to push out whole software. Say, any one of the 37 free tools that SolarWinds has.
Shameless plug. So in this case, we’re going to use Notepad, so talk us through this. How do I get this out onto a system?
Okay, so, there’s a couple steps here. The first thing is to identify the specific packages you want. Now these packages synchronize in, like you said, automatically tested, they’re pre-tested, they’re ready with all the rules wrapping them. So all you really have to do is select which one you want to use, in this case, we’ll actually use the most recent. So we’ll go down to the bottom. Which is not going to work, so we’ll re-sort it, and then we’ll go to the bottom. So here’s 754, so we got 754 here. And all we have to do is say, “Okay, well I want to either upgrade systems that already have it on there,” there’s a specific rule in here, as part of the prereq rules that determine whether or not it’s already installed. If it is installed, then we’ll upgrade it. If it’s not installed, this is not going to push it. That’s what this one is for. So you can actually, let’s say we need it everywhere. And like, these are Windows servers that I’m worried about sending it to. So I actually like this on there. So, I’m going to go ahead and publish the contents.
Okay, so that one’s been pushed and published. What’s next?
Okay, so that’s out there on the server. But we don’t actually have any approval set up. So it’s basically just going to sit there and collect dust. So what we’ll do is we’ll come into third-party updates, and then we will actually find it. Here it is already, we can say, “Oh, I want to approve this.” In my case, I want to approve for install everywhere. So all I have to do is select all computers, approve it for install, which this one’s already done. You can see here, it’s already been approved. I did this a little earlier to kind of help.
And then, approve it.
Okay, so we’ve pushed out a package. Again, not a deep dive into the mechanics of it.
But the real power and especially as it relates to the ARRR framework is knowing what happened and what didn’t happen. So, really, where I want to go next is, how can I know, how can I know the health of my environment? So, the first place I want to go is back to all the computers where you might’ve missed it, but here I can see some really interesting information. I can see the, not just the operating system, but I can see the disposition of the machine. It’s, is it considered a workstation, or a server. I can also see, on this machine, one update is installed. There’s six that are not installed. Three is actually pending, in a pending status. So you can see in any given, you know, for any given system, what’s waiting, what’s going on with it.
I think that’s important. But there’s other reports, also. So, over here on the side, we have really aggressive, you know, interesting reporting. The one I want, one of the ones I want to highlight is the analytics if I run this approved updates. Just right-click and run the report. So here, what I got from clicking on each machine individually, now I can see for the entire environment, right? I can see how many, you know, have been approved, but not installed, how many systems were approved unsuccessfully, you know, downloaded and failed, etc., etc. So I’ve got, you know, those kinds of things for the entire environment.
Another report that I want to take a look at it is under Task History, no, just, different category. The Published Updates. So once again, here, I can see which patches were pushed out to which machines, whether they failed, or succeeded, and when you’re talking about the reliability and accounting of your environment, that’s what you need on an enterprise basis.
That’s what it brings, that’s what it brings to the table. So I think that from a concept of ARRR, patching is that basic blocking and tackling, but how you get there is up to you.
And, I believe that the level of rigor, and the level of reliability that you have, in terms of knowing where your environment is at, depends on having sort of, you know, robust tools that will get you there.
And that’s the case for everyone. So, just make sure that you’ve got separate groups for your servers, you’ve got them for your workstations, I mean, mostly we’re kind of, gearing towards servers. But one thing I like about our solution for this, and having this real nice, kind of, reporting engine is, “What happens if you’re running Windows core?” Windows core has no UI, it’s kind of a little difficult to tell whether or not your updates are going cleanly. We’ll actually be able to report that stuff back.
Right, alright, well, as always, thank you for joining us.
Oh, no, it’s a pleasure being here.
Okay, fantastic, thank you, everyone.
ARRR, there be logs, matey. C’mon, I’m just trying to keep, you know, with the theme of today’s episode. No, but seriously, how many times have you, and I know you guys certainly have, had a problem that you were trying to troubleshoot that if you do not have the detailed logs, the actual logs, the transactions that happen, it is impossible to troubleshoot. I mean–
All the time.
Right, it’s really bad, there are some things that make it a little bit easier for us. I think, certainly if you’re using, like, virtualization, you have, the platform offers some of them.
That’s true, though the platforms like VMware vSphere, Microsoft Hyper-V, they aggregate all those logs for you.
Right, but especially now with like, Kubernetes and Docker, and even maybe a custom application that your Biz Dev team created, right? That’s doing like, API access against Salesforce. Those things are all creating logs all over the place. And if you don’t grab them, if you let them be ephemeral and disappear, you’re not going to be able to go back and fix things, especially when things are hard down.
So you’re talking about the hybrid IT case, right? Are we talking Loggly?
Well, we’re going to, I’m going to use Loggly and PaperTrail also to sort of show that roll up of once it’s all aggregated. But the thing about this, and you guys tell us all the time is that the real challenge seems to be how do you get past that first mile? Like how do you actually configure Docker? Or a Linux server, or CloudWatch, to be able to send that data and get it where you can do something with it, regardless of the product that you’re using. So I want to, I want to show them that.
Yeah, or how do you get developers to start thinking about logging before they develop their application?
Putting logging in at the beginning of the application, Kong, you devil, you. Yes, next thing you know, you’re going to be talking about putting in security at the beginning.
You know that ain’t happening, brother.
Alright, so let’s do this, let’s do Docker. We’ll do Linux, journald, I’ll show you how to set those up pretty quickly. And then, we’ll do the AWS CloudWatch as an example. And we’ll just do like, two or three minutes on this. Get you guys started and start thinking about how to do this, yeah?
Alright, let’s look. Okay, so we’re looking at Linux here. This is Docker running on a Raspberry Pi back at my house. But pay no attention to that. And I want to actually, you know, look at logs, like the native Docker logs. We’ll start with that, and then how to get it out.
Right, okay. So, the first step is, “How do you normally get logs, right?” You’re going to do docker, ps -a, give us a list of the running Docker instances. And then, I would typically do docker logs. And the container ID, and that’s actually going to spit out the logs for that instance. But what I want to really do is be able to do something a little bit better and aggregate those logs. And you’ll notice, I just did Docker logs right here. And it says that the Docker does not support reading. Why does it not support reading?
Why doesn’t it support reading?
Because I’m already redirecting those into my journald, using my syslog driver. So just by default, I’m doing it that way. And the way that we’re going to do that is take a look at the daemon configuration for Docker itself.
Now, normally by default, we’d have log driver syslog, or a lot of times, this file won’t even be there. So the first thing I did was I go ahead and created this JSON file, and I said, “Log driver is journald.” So basically, anything that would normally be going to STDOUT, or STDERR, is now going to be piped to my regular syslog, syslogd, daemon, and it’s going to go into the journal. Okay, so now I’ve got centralized logging for Docker.
Next, how do I get my centralized logging for my Linux instance out? So for that, we’re going to take a look at the journald config.
And, if we scroll down here, you will see ForwardToSyslog=Yes. Normally that’s just, it’s already set to yes but it’s just commented out. So I’m going to just remove the comment. And then restart my journald service. And then the last thing that we’re going to do is, so, so, so, that is now forwarding from aggregated logs from Docker into journald, now we’re going to send that to our syslog. So we’re going to take a look at our syslog config. And if we scroll through here, you’ll notice that this one is set up pretty much by default, right? It’s telling me the type of logs that I’m willing to receive, that are part of our syslog, it’s kind of, you know, a little bit about catch-all, and sort of where those have been cached, and how they’re being sent out. So then, how am I getting it out? In this case, I’m using Loggly as my aggregation service, so how do I finally get that out? Pretty easy, we’re going to go over here to etc/ rsyslog.d/. And if we take a look at our files in here, you’ll see, “Oh, look at that. There’s a PaperTrail and a Loggly config.” So, we’re going to cat our Loggly config and this is the config that’s actually set up by the Loggly script. Now I did it, I did it by hand, but you can use the script to set it up instead. And what this thing is going to do is just tell me a couple of things. The first thing is sort of, where that spool is, the amount of file space that I want it to have. But the main thing that I want you to look at here, is the Default NetStream Driver CAFile. This is really, really important. A lot of times when we do logging, especially over the internet, and with aggregation, it’s really easy to just do UDP, do UDP syslog out, no problem. But especially as you aggregate logs and you’re getting more and more information into smaller, and smaller conduits, the chances of somebody being able to sit there in the middle and sniff that off, and accidentally, there’s a password or some other critical data that’s a part of one of those log messages, being exposed, gets higher, so make sure that you are using TLS as a part of that. So that basically defines the TLS. And the next thing down here is the template that defines the format that Loggly expects, it’s doing a little bit of remapping for PROCID, and APP-NAME, and that sort of thing. And then down here, the last declaration is the action to just forward it, using TLS and that certificate out to Loggly. And it’s sending it out on a particular port, and to a destination.
Patrick, you’ve shown us how to modify the config files, a lot of CLI commands, to me this spells, you can automate it, you can script it.
Absolutely, you would typically do this with a script. That is exactly right, and especially like, you might want to embed it into a Docker container itself. Right, so you don’t have to worry about this. Is it running in Kubernetes? Is it running in Docker Swarm? Is it part of a container service running in AWS? You don’t care because the configuration is actually baked in so no matter where it runs, you’re going to get that data. So let’s take a look at what that looks like. So here I’m, we’ll do it in PaperTrail first because it comes up a little bit more immediately in PaperTrail because it’s really designed for trailing. Right, that’s what it does. So, I’m going to come back over here. And we’ll just, for grins, say, logger, “hey there from pi.” Now come back over here, and look at that. “Hey there from pi,” but there’s also some other things that are going on here. You can see that there’s some messages here that are talking about not being able to update a registry for a particular host that’s running in Docker, right?
That’s something else that’s going on. Now, I could certainly query through this here. But the main thing is, remember, I’m aggregating all of those Docker messages. So I’m going to come back over here to Loggly, and look at the same data that we were looking at before. Now, this is, these are all the same events, they’re just parsed out, sort of raw. But what I’ve done is gone in and done a couple of the little pieces of analysis, right? So the first one is, if I look at this from the syslog perspective, I might want to break this down by host. Right, so I’ve got a couple of them. I’ve got the pipeline itself, and a couple of other ones that are sending it. But you’ll notice, this one right here, right, it’s parsing this out for me, and it’s– I’m going to break this out here so you can see this a little bit better. So this one is, it’s telling me that it can’t authorize to a Docker hub post, right? Okay, that’s helpful, but how many times is that failing? Right, and in this case, it’s watchtower running. And it basically can automatically pull down new versions of an instance and add it. Well, I think what’s better is actually to look at this more from a dashboard perspective, right? Because if you’re aggregating all those logs, go ahead and pull those together on a dashboard. So what I’ve done here is just taken what were charts, that were coming from the events, and including the data that was in that. And just set up a couple of little rules, like one of them is this container restart failure. And that’s telling me that on a particular interval, I’m failing that restart, that is something I didn’t know.
Until I had all the logs in one place.
In signal processing, that’s filtering the noise out.
That’s exactly it, and doing it, or like, down here, for example.
Here, I’m looking at sync analysis, apparently, it says sunk analogies. [Laughter] But these are sync analogies, right? So this is telling me where, I’m either getting a whole lot more events than I expect, or less. And then I’m going to be able to drill in and figure out what those are. So, for example, here’s a list of the messages by application, so this is telling me like, sort of, application and density, and again, that was just something I did as a chart. And I said, “Hey, let’s save that over a dashboard.” I can look at my total number of events. Here’s how many times we sudo, this is when I just did the quick–
Sudo over so that I could change directories and pull that Docker config. Here’s my new sessions by host, down here as well. And again, if I didn’t have all this data in one place, I wouldn’t be able to, sort of, figure this out after the fact, right? I’ve got to capture it then.
Yup, you are observing your cluster because you want to maintain control of it. Observability and controllability for the win.
That’s right, okay, so one more quick thing. Over here, you’ll see this, this panel in my dashboard, which is Inbound Calls. Now, I get a lot of spam calls, I know you guys do at home, right? So one of the things that I was thinking about was if I only could analyze some data that was coming back as a part of the JSON package that’s coming out of, this case Twilio, that’s running in AWS. [Laughter] I can actually chart bogus calls, against legitimate calls that are coming on in the interface, now in this example, it’s phone calls, right? But normally, this would be a business value that you would care about, right? Where it’s embedded in a log, it’s JSON, that’s sort of long and forgotten, but you’d like to be able to pull it up. Well, the nice thing is being able to just start that with a search and do exploration, and save that. And the way that I did that was, there’s a piece of data that I’m looking for here. There’s a JSON message that comes back. So I’m looking for a Twilio request. I want the direction to be inbound. And I want the call status to be ringing, right? So I’m going to just search for that. There haven’t been any in the last 10 minutes. So we’re going to go back here and say last two days. Ha-ha, so then these are the parsed out messages I’m going to have to blur this out because I just gave away my home phone number. But like, even here, this Nomorobo spam score is the thing that’s coming back and telling me whether it’s a bogus call or not, right? Now, what does this have to do with resiliency?
Yeah, especially for business continuity that we were talking about.
Mm-hmm. Do any of them, do you think, have IVR systems? Inbound systems to their businesses with customers that actually care about whether their calls get through?
Right. So how does, so when there’s a problem with the IVR, there’s a log somewhere.
It’s going to be a configuration issue most likely, or it may be a service-provider problem. But this is not bandwidth, this is not an interface. This is something at the application layer. And being able to capture that in a remote place, and then surface it so that you can go find it, is the way that you can get back online quickly. So if you think about one of the things you’re actually doing by providing really rich logging, is it’s almost insurance against not being able to solve something quickly, right? It makes you feel better because you know, you know we get accused of hoarding logs all the time. Right?
But if you know you’re collecting it if you can go to management and say, “Not only do I have something to analyze my logs,” and if you want to use Elasticsearch, you can do that. But it’s like, the main thing is I took the time to export those, to actually make sure that those are all aggregated in one place. You can sleep better because there’s nothing that you won’t know, and it’s the truth. It’s not just an opinion based on pulling an interface. It is the truth of what’s actually happening. A single bad call, a single angry customer, there’s going to be record of it somewhere.
I think I’m starting to see some of the truth from the opinions because you’re taking something that’s cloudy, ephemeral, and you’re making it persistent.
I’m persisting it.
And, not only that, but you’re surfacing the single point of truth because now you know exactly what happened, why it happened. And you’re now more resilient because now you can defend against that.
That’s right, I can get metrics, and I can troubleshoot if I just take the time to set it up.
ARRR Pillars at work, availability, reliability, resiliency, and recoverability.
Alright, so we’re going to take a minute and talk about backup, and I know I make fun of backup a lot but it is really a critical component of ARRR, to people to make sure that you can recover critical assets for your business. And joining us is Carrie Reber.
VMCarrie on Twitter.
That’s right, and you’re?
Chris Groot, @ChrisGroot on Twitter.
Yeah, so be sure to hit him up online. But Carrie, you’re joining us also. You’re part of the product team for backup.
Yes, I’ve worked for a few backup companies over the years and I’m happy to be part of the product team here at SolarWinds.
Awesome, so what are we going to talk about today?
Well, the first, the first we’re going to talk about is some best practices and principals that you should be aware of for your backup policies.
Like the 3-2-1 principle?
That’s the one, so 3 copies of your data, 2 different media, and 1 off-site.
When you say off-site, a lot of times people think about the cloud, of course, but that can be good and bad.
There’s a lot of concerns around data-locality.
Where’s your data?
Yeah, where is your data?
Do you have–
Who’s got it?
Who has control?
Can anyone see my data?
So that’s something to definitely ask vendors about where they’re storing off-site.
Yeah, I’m understanding that the data needs to be encrypted before it leaves production devices, when it’s in transit, or going over the internet, where it’s located, that it’s encrypted there. And that you control the keys.
And then the second piece is that, around the compliance piece, so things like data sovereignty are super important to companies these days, knowing that if they’re located in the U.K., even different offices for one particular company that may be working globally. Knowing that, let’s say, they run in the U.K., or that they’ve got an office in the U.K., that the data stays in the U.K.
So, things with that detail are really important to be asking with that, that 1 or the 3-2-1.
We also hear a lot of talk about RPO and RTO. Those acronyms.
Do you know–
Can you talk us through those?
What those stand for, Patrick?
Why don’t you refresh my memory? [Laughter]
That’s Recovery Point Objective. And so, that’s how much data that you can afford to lose and the Recovery Time Objective. So once you do have an outage or an interruption, how long you’re going to be down for.
But wouldn’t you say those are more SLAs that are actually, that the business cares about? Rather than a traditional sort of availability or timeout, or some other reachability number that we sort of tend to internally talk about?
Absolutely, yeah, that’s, it’s the business conversation that you can discuss with people like, “What can we truly afford in terms of downtime?”
Because you’re talking about risk.
Not just some arbitrary number.
And it’s important too to remember that not all disasters are big disasters. People often think, “Oh, a meteor hits the data-center. You’ve lost everything.” And that’s what they may think they need to be prepared for. But far more common are those small disasters. Where perhaps an executive accidentally deleted a file that they need for the board meeting.
That suddenly becomes a really high priority.
Well it’s a high priority and it’s really, I mean, it is important to the business. Don’t, don’t, don’t let me say that not being prepared for a board meeting is not. But from a, but, you know, to that perspective, is that we spend so much time being ended up driven by little things, that really, I know it’s critical for that, from that one perspective. But that’s just one file that’s being recovered. Or maybe it’s something like a single file, on a single workstation, or a config file for a server, it’s important. But for you to be able to be productive managing all the other things that you have to do in IT, it can be an enormous distraction.
So you have to keep it simple.
Yup, keep it simple, and there’s a couple of elements to that, one is, first of all, standardizing your rollout. So understanding that you’ve got policies across different devices, and how your servers and your workstations are being protected from that perspective.
So you’re talking about being able to assign profiles to groups of devices.
Yeah, exactly, yeah, so, when you, you have different workloads with different needs. Some devices need to be backed up once a day. Some devices need to be backed up once an hour. So being able to specify with every single device, exactly how they should be handled, how often the schedules, or how often it needs to be protected, how often a cycle needs to run. And then another aspect to understanding, or having good protection is from a recovery perspective. It’s great to have that copy off-site. But in a lot of cases, a lot of IT admins need faster RTO. And one of those ways you can do that is to also keep a copy local.
Or just sleep better knowing that you’ve got a physical copy of your own data.
That’s true too, that’s true too. And so in terms of that, being able to have a local copy and to be able to make it very easy to store that, and keep it in sync with what’s also being backed up to the cloud.
So the recommendation is, get started quickly by making a backup off-site, and then follow-on is go ahead and create a local, backup copy as well.
Yeah, it’s time to value.
Yeah, and if you use this service, it’s really fast. Because traditional backup software that’s been around forever, you download a piece of software, you install it on a local server, then you have to identify the storage that’s going to be your backup targets. You have to provision that storage, you have to configure it and set it all up. If you want an off-site copy, in some cases, you have to separately get an arrangement with a cloud provider.
And configure all of that, it can take a long time. But if you have a cloud-first service, you get a lot of quick time to value.
So that means that 2, in 3-2-1, so that gets you an off-site copy, and a local copy, without doing sneakernet.
And then, it really is that time to value that is really important. We had a funny story in here just last week where I was talking, discussing with internal IT, and they were going to actually wait for, I think, 3 weeks to get local storage for this local copy.
It kills a lot of backup projects. You get hung up waiting for that local copy backup.
And I said, “Guys, like, let’s do it today.” And literally the next day, they were rolled out and everything was protected as planned. Now, when that hardware does come in, they are going to connect it up, and it’ll sync itself back from the cloud. So we will get that protection. But it’s still better to have one copy. The only thing better than, you know, one backup is having two backups, right? And that’s really where we’ll be in a couple of weeks.
Well, but the other thing about using a profile is not every piece of data has the same criticality. Maybe your time for backup and recovery could be different, and so, it also lets you, based on the profile, decide the speed of that storage for that backup. So you can actually have tiered backup performance based on the types of data that you’re backing up, yeah.
Yeah, you can even have granularity in terms of file types. If you decide MP3 files are not something we want to protect, you can make that choice.
So, once you’ve got your profile wrote up. The second piece of keeping simple is then, managing by exceptions. So you know as long as green, green is good. You don’t have to spend an ounce of time on something.
Cause people are busy. I don’t know too many people, I mean, a large organization, maybe they have a dedicated backup person. But most companies, I think, you’re wearing lots of hats, you don’t have time.
Right, you’re looking at, you’re basically talking about two different views, right? So one of them is the breadth of my backup. So an exception which systems are not being backed up. And then the other one is, the other exception is backups that are not successful.
Or they’re slow, or they’re not meeting their SLAs.
Right, so, to manage those exceptions, one of the things you really want to look for is the ability to filter down to find where there might be an issue.
So we can see from many devices, down to a few. And you can see by clicking on a particular device, exactly where that issue was. So if it’s only one file, out of 309,000, that’s probably just fine because it backed up the day before, or you can back it up next hour. Or, if you really want, you can actually open up that particular agent, from that particular device. And just run a backup in real time.
Okay, now, you just connected to the agent. So this is something that I install a server, and then distribute agents, or how does this work?
So the management layer is hosted.
Okay, so it’s a, so–
All your reporting.
So all the reporting, central dashboards, everything else, that’s a SaaS, that’s a web-based tool. And then the only thing you’re deploying here is agents?
Are you pushing the agents or it’s part of a group policy, or how are you getting the agents out?
Well, you can choose how you want to do it. So you can do like, the manual installation, one by one.
No, we don’t want to do that.
No, we don’t. Or, the preferred would be to do automated deployments. So you can do it through Active Directory group policy. You can do it through, you know, it’s scripted, so.
Whatever your favorite software deployment method is. You can bring that up, and be provided with that script. And so, it’s very, very easy to deploy, you know, via PowerShell, or…
So, and I didn’t mean to interrupt you, but you were showing just a second ago, the view from the agent, so that view, you basically are logged in to the console, and when you need to get to a specific machine, it’s actually going to proxy that through for you.
So if you’re on that local machine, you can pull up locally. But if you’re operating from the console, and working with someone who’s not sitting, or if you’re not sitting right beside that machine–
You can be on the local machine so that if, for example, you had a local copy, and the network was down, you could still do a recovery?
What do you know?
Oh my gosh.
And if it’s after hours, and you’re at your kids’ soccer game, and you need to do a recovery.
Could you even do it from there?
Yeah. You can do it–
I love it!
From any type of device. So the restore is very simple. Just to get back into a particular date. You want to be able to drill down into any particular detail and run that restore. It’s just got to be that easy so when that phone call comes, it’s a 30-second task, and you can go back. So managing exceptions is part one. So you know exactly where the problems are, delineate them, and make sure that they’re corrected. Second thing is, dealing with recovery, knowing where, how quickly you can get files back.
And the third piece and last piece that we’ll talk about today is testing.
Often overlooked, you’re exactly right.
And so, understanding that you can actually take devices and test that recovery in a non-intrusive, or non-interruptive way. To actually be able to create a standby copy of an entire physical, or virtual machine, in another virtual environment, or even like in Azure, or something, just for testing purposes. Having that available and being able to verify that those servers will actually boot up is an important piece of the equation as well.
And it’s interesting because I like to sort of make the joke that this is continuous recovery delivery, right? Because in a CI/CD environment.
Recovery of the swift.
One of the things that you would be doing is post-deployment tests, right? So you would automatically make sure that that service came up, and that it ran the way it was expected to be testing it at build, you’d be testing it at pre-deploy, you’d go live with it, and test it again, so if you are not, and I make the joke over and over again, that if you were not testing your recoveries, you do not have backup. So to be able to automatically make sure that you have a post-recovery test that runs takes a lot of that risk away, and it also lets you walk away from the system and let it do what it does. And focus on something else.
Absolutely, so knowing that you’ve got devices where that’s required, and other devices where that’s not required, and you can apply the appropriate policy, know that the backups are successful, and know that you’re, when it comes time, when that phone call comes, which ultimately, it will come someday.
Or it’ll be an exception report that pops up.
Knowing that you’re ready and that you’re going to deliver, you know, cash in the insurance at that time, is really how you tie this whole thing together.
Do you find that it also helps people get past that sort of, the first step of the backup where they’ll say, “Yeah, I’m getting a backup. But I haven’t tested recovery because I literally cannot set a maintenance window where I’m going to test this on the actual hardware.” Being able to, do you find that people will do a recovery-test on another machine, or maybe in a VM to make sure that it’s–
That’s it’s working before they do the actual full-up production test?
Absolutely, like you can get it 90% of the way there, which you de-risk most of what you’re dealing with at that point, and then, you can save that last 10% for a very special time where you’ve got the weekend to work.
So, with that, I think we’ve really covered off the best principles that sort of wrap it up.
We talked about 3-2-1, we talked about setting expectations with the RPO and RTO, keeping it simple with policies, managing by exceptions, and then finally, how to recover, to make sure you can do that quickly in the case of the interruptive, file loss, database losses, where the operating system is still running. And being able to test full system recovery for when that really bad thing ultimately happens to you.
So you can sleep at night with less cost, less complexity.
Hey, they’re back!
Here we are!
And I know, I do tend to take liberties talking about backup–
Yes, you do.
But really, recovery is really, really important because that’s what allows me to do creative things with data and not worry about blowing it up.
But it’s cloudy and we persisted, we dared to persist.
We did talk about logging a little bit. I think we’re still going to debate the difference between opinion and truth, but yeah, I mean, that’s one of those things that I think we tend to forget about, sort of that first mile of how do I make sure that I’m persisting my log data so that I can look at it later, or get metrics off of it, when so much of it is increasingly generated little bits here, little bits there, not just in the cloud but actually even on-premises with distributed applications.
Right, and when you’re talking about that first mile, that per-mile is the foundation, knowing that the systems that you’ve got are stable and reliable, and that takes us back to the patching piece. Not just the emergency patching, you’ve got, you know, Spectre and Meltdown, and things like that. Not just those, but just knowing that your entire environment is in compliance with what the organization feels is necessary, whether it’s internal applications, or the external, regular systems, operating systems patches, but you have the trust that everything is what you think it is and you can find those outliers and deal with them.
Improving reliability by applying a little bit of forethought, ahead of time, to assure it in the first place.
And that’s how you keep your company safe, leveraging the ARRR Pillars, right? Availability, reliability, resiliency, and recoverability. And we covered it well in this episode.
I think so, and of course, you’re famous for adding a million fantastic links below this episode. So you’re going to include some more, specifically coming out of AWS, right?
Yes, absolutely, we’ll have a link to AWS’ Well-Architected white paper, as well as Microsoft Azure’s reference architectures, and NIST framework on risk management as well as cybersecurity.
Awesome, I think we got it?
Absolutely, for SolarWinds Lab, I’m Kong Yang.
I’m Patrick Hubbard.
I’m Chris Groot.
I’m Carrie Reber.
And I’m Leon Adato, thanks for watching.