Home > SolarWinds Lab Episode 63: Pingdom, Loggly, and Digital Experience Monitoring

SolarWinds Lab Episode 63: Pingdom, Loggly, and Digital Experience Monitoring

Once, web page load time was a good enough metric to ensure that you were delivering decent web performance and few trouble tickets. But now, several factors have collided, requiring metrics that pinpoint not only which component is performing poorly, but also identify global availability issues, platform and architecture limitations, and end-user problems. In this episode, SolarWinds Cloud guru Michael Yang joins Head Geek Patrick Hubbard to discuss how monitoring a mix of APIs, services, and websites require a shift in attention to include all elements of an application.

Back to Video Archive

Episode Transcript


Hello, welcome to SolarWinds Lab. I'm Michael Yang, and Patrick, wow! It's so much better to be back in the studio. I know, right? I mean, I really enjoyed our episode that we did at AWS re:Invent, but it was really loud, and I was afraid we were going to get run over by a forklift. Hey, you didn't have to wear that huge headset mic, but the feedback was great, and you really seem to enjoy the interest and the challenges of monitoring modern distributor application infrastructure, and application tracing. Yeah, and that's true, I mean, I've been spending most of my time now on cloud deployments and especially helping our customers that are sort of doing both, that are managing both enterprise and cloud. But it's interesting, based on the chat that we had last time, is how many of you, even if you're still mostly on premises, have if not outright distributed architectures, at least multi-element applications that rely really heavily on APIs. And tracing them really does turn out to be the only way to monitor them. That's right, and the class C application stack is beginning to become less and less monolithic. M-hmm [affirmative] And that's what we wanted to talk about today, digital experience monitoring. So, you want me to just go ahead and ask the question that I had the first time that I heard the acronym DEM? Sure, you're going to ask anyway. Okay, well, isn't digital experience monitoring just the same as web performance monitoring, or RUM, real user monitoring, something that's just hammering on a URL of a web server? Well, no. [Laughs] Let's break down the monitoring to internal and external observability, right? Internal observability are using metrics, log, traces, for application and infrastructure monitoring. Right. And we have fantastic cloud products like AppOptics, Papertrail, and now Loggly for it. Right, it's detail-component dense. Absolutely, it's about optimizing your cloud application infrastructure using the right set of tools. However, what's also really important is the external observability. What does the experience look like for external perspective? User perspective? Ahh, so not just if you're DevOps, but actually if you're regular operations, you need to know whether your websites are up or down or performing poorly before end customers do or before worse, maybe a business stakeholder actually picks up the phone and calls you. Little bit more than just up, down, right? Is it performing better or worse than yesterday? Responding to changes that you may have made in the infrastructure, or is it responding better from the US than it is from Asia or from Europe? Yeah, that's right. DevOps or ops person need to monitor both external and internal observability, and you also often feed external visibility and this type of information to internal observability for correlations. Right, and it's what people used to call end-user experience monitoring. So, how is that different, really, than digital experience monitoring? Well, there are two key really differences. First, it's not about just human or end-user experience of the website. Right. It's also about machine-to-machine or API-to-API integration. This is the point where you're trying to bait me with the acronym API, to say, is it part of the API economy? Ding, ding, ding, correct, and you just did it. Well, I was successful at it. Let me give you an example on one of the largest real estate management companies in the world. They use Salesforce SAS-based marketing solution and they build internal application that links to Salesforce Marketing Solutions via API-to-API integration. Sure. They need to know whether Salesforce API is up and running properly if API performance is acceptable. Guess what solution they use to monitor Salesforce API integration and its performance? Pingdom. You're a genius, Patrick. That is absolutely correct. They use Pingdom to monitor as well as use Pingdom to report to have SLA and yearly contract discussion, money, money. They like Salesforce, but they need to independent and leading solution that can provide unbiased data. Right, trust but verify, I get it. And there's something else about that too. It's not just when we're negotiating the contracts do we get a better deal? Or do we get a refund for poor performance? It's just ensuring that we get the SLA that we expect and that we're paying for. We don't want a discount, we want great performance from that vendor. At Salesforce we trust you, but we like to verify. The second item that's really important for digital experience monitoring is that website performance is more than just tech issue. Business stakeholders like chief marketing officers really care about what website performance is like. Yeah, they really do, and they care about it maybe in a slightly different way, you know? Because if you're an administrator and you're focused on delivering that service, you're really thinking about feeds and speeds. And so the metrics that you're really thinking about like website performance, capability, memory utilization, all of the elements of the infrastructure are a little bit different, right? So if you're talking about the business owners that they're actually looking at things like conversion or engagement or all of that fluffy marketing stuff. Fluffy, fluffy. Right, because it's tied to their bonus. For example, our CMO, Darren, right? He cares if SolarWinds goes down, but that's not really likely, solarwinds.com is pretty reliable, right? But what is more likely is a performance issue, and that has a direct impact on Darren's bottom line, which is customer acquisition. Patrick, that is an awesome example. CMOs or business stakeholders that do online business, which nowadays is everyone, right? Really cares about website performance insight. Their number studies, which directly correlates website performance to site visitors and customer acquisitions. Customers, personally, myself, you know, you don't want to stick around a site that's too slow. I would rather go to some competitor's site that would give me a fast performance. Right, and you'll see things like changes in upsell numbers based on how long they sit spin dwelling on the site. Or how many different additional items you might be able in a shopping cart to present in a short period of time while they're actually in that mood to be loading up the cart. Definitely. So the goal then is to correlate the metrics the business cares about, like site session and balance rate to the metrics that we typically keep an eye on, like responsiveness, page views and page load time. You know, that's correct. And we'll use Pingdom as an example to give you some recommendations on how to use your tools to combine everything into a single view. Awesome, so this is going to be Pingdom Visitor Insights combined with DEM metrics. So let's do it this way. Let's talk a little bit more about DEM, what it is and what it isn't, and how customers like you are actually asking for new features and what your specific challenges around monitoring are. Especially monitoring dependencies, not just performance at the browser, and then we're going to get into demo. As long as I get to demo on the touchscreen, I'm good, Patrick, and hey, one other thing. We've had lots of questions about Loggly. What is it, how it's different from Papertrail? Okay, so for Loggly, do I get to do that one? Because I really want to compare them side by side, because I'm a bit of a logging nerd. Logging nerd? You just called me a nerd. Nerd. So Patrick, let me show you. So what you're seeing is a Pingdom Visitor Insight, and the first thing that you'll notice is how easy it is to notice some key metrics that we talked about. Right. Around the user experience and business metrics. Because these dashboards, you want to try to make sure that your dashboards are designed to be easy to read, not just for you in operations, because we can dig through huge dashboards all day long, we're great at that, but if you actually have a CMO who's looking at some of these metrics, they need to be easy for them to understand too. That's perfect, Patrick, yeah, that's exactly the point. So one other thing you'll notice is that, for example, some of the fluffy metrics we talked about, you see right in front of you, right? Active sessions, so on this particular website, which is Dat Host, which is a game hosting website. It's all about Dat Host. Dat Host. So you see on this website, you have number of active sessions, right? M-hmm [affirmative] And that is 100 active sessions. And this is, we had 121 one hour ago, one day ago, one week ago, it's really simple and right at your fingertips. So for gaming, that would be a measure of engagement. Exactly, number of people that's coming to the gaming website. Along with that, some of the information you see is what's the low time aggregate across the Dat Host website? Which is, right now, 6.32 seconds. Obviously, this is going to change as throughout the day, and this is what the average looks like one hour ago, which is 3.42, 4.8 one day ago, and one week ago, right? And we're going to get into the details of that, but at this level, you might be monitoring, let's say, the top level page that's going to contain all of the JavaScript logic, images, everything else, so that would be like the full load time to get that initial experience. That is absolutely correct. Not only the front-load time but the network time and the back-end time as well. Right. So when it really takes three seconds, seven seconds, what is that really related to, right? Right. And some other metrics you'll see is things like bounce rate. So, obviously, if your website is really slow, people are going to comment, say, "You know what? This website's too slow, I'm going to bounce right out. I'm going to go to your competitor website,” right? So you can start to correlate some of the low time, active sessions, page views, and Apdex score associated with some of the business metrics that are associated with website performance. Okay. And one other thing you'll notice is that, again, as a person, as a business, you may have multiple websites, right? So you can add that host, or you can add things like Pingdom.com or Pingdom too, so different websites associated with that. Okay, so then these, in this case, you're monitoring a bunch of live servers. Exactly, live servers and we're absolutely monitoring live servers, right? Yup. So this kind of tells you, this front dashboard tells you, do I have a problem with my website? And those problems, are they related, and how can I correlate those problems and performance metrics with some of the business metrics that we talked about. Okay. And see it in an easy-to-consume manner. Right. So then that conversation is now, you and the CMO or maybe you and someone else who's a manager over part of the business, so then the next question you're going to ask is, okay, now I need to figure out what is causing, let's say, a performance issue, then it kind of comes back to us in operations and detail. Patrick, that's perfect. So this tells you, first of all, do I have an issue on my website? M-hmm [affirmative] Or across multiple sites, and the next thing you want to know is where, who within my audience is having that issue. Right. And third is, what is my root cause behind that issue? Got it. So what we're about to see is, let's go to the next step, which is, who within my website audiences or visitors are having this issue? So let me take an example, I'll just click through this Dat Host website, Mm-hmm [affirmative] And you see here is again, some of the key information, like active sessions, load time, Apdex score, and bounce rate. Hey, look at this, world map. And that looks like light-speed time to me, right there. Yes, so you see here, our users in Sweden and Nordic countries and UK, they're green, but maybe the people in the US is not having a great experience for Dat Host website, and we want to really drill into that, right? So what we can do at this point is obviously, you can filter by the country, so you can go and what the active sessions, the load time looks like across the US, Brazil, Germany. You can also look at the platform information, like what is the experience looks like on desktop, phone, and tablet? Obviously, if someone's coming from and checking out Dat Host website from a bus that you're on, on a 3G network, that's going to be really slow, right? Right. So we want to filter those out there as well, or be able to see some of those correlations associated with that, not just country, but where, what devices, and so on and so forth. Well, especially if you're trying to make decisions about investment in the primary interface for an application, right? Yeah. So we've sort of hit that point, according to a lot of surveys, where we're now 50/50 or maybe even leading now mobile over desktop, so you're deciding as a part of your application design, what you're promoting is whether people are having that mobile experience that application or they're going from their regular desktop, so you're making decisions about investment based on responsiveness or the experience that users are having by platform, right? Absolutely. It's not just, hey, this thing responds in so many seconds, so that's all I get. Exactly, so that information, like where, like who's having that audience and who's having that problem, right? Right. This feature, I really like, check this out, this is really cool. So what I can do at this point is, as a user I can go in and edit and say, you know what? If the page load time is anywhere between zero to three seconds, we'll say green, but let's say my threshold is a little higher. So let's say if the page load time is zero five seconds we'll say green, right? So what you're going to see here is once I save this, the graphics around this is going to change. So what you would see here is that, for example, we had a US region that was yellow, because we made our threshold a little higher, where we say the definition of green is anywhere from zero to five seconds. Now you see certain regions, becoming green again. And you're making that as an informed decision, based on satisfaction for users as indicated by other metrics, like in the case of engagement, for example, you can say, "You know, it is performing "a little more slowly, because just latency "across the undersea cables, right?" But if your other business metrics are telling you that you're getting great engagement, you can make that decision and say, "You know what? That's green. That's okay. That's acceptable status." Yeah, maybe you're hosting many data centers in the US, so obviously, people coming into US will be a little faster than the people that's coming from Europe or vice versa, right? M-hmm [affirmative] And what you can see here is that throughout, seeing those information, you can do it through a top country by session, top platforms, so for those users that's coming in, are they mostly coming through the desktop, phone, and tablet, obviously here, we're in mobile world, so you're seeing a lot more phone and tablets. M-hmm [affirmative] Coming to Dat Host website, and you can see different sessions, the active sessions that we have and where are the users coming from? So obviously, with the desktop, you'll tend to have a longer sessions associated with that, so you see information like desktop has higher engagement at this point. Right. Right? Different browsers, I don't know if anybody uses browsers outside Chrome, but obviously, in this data, people do. Opera, when was the last time you saw Opera browser, huh? [Both laugh] And you see, obviously, Chrome, Firefox, and Apdex, and you see different load time distribution by page views. And this feature, I really like. So when you start to look at page load time for Dat Host or Pingdom.com or your website, you see that those load times is aggregated across all your pages on your website, right? Right. But what this shows you is that you can break those down by each individual pages, right? And what this provides is obviously your homepage, dathost.com, that shows the most page views, obviously has the certain average load time associated with that, and what we do is we combine this number and say, "Hey, total load time for this default homepage "is one day, two hour, 45 minutes," so you aggregate, you add the page view numbers and average load time associated with that. So that's total number of views, if it was one person, how long would they be sitting to load all of those? Exactly. Okay. Exactly right. So total page views, so what you want to do here is that you want to optimize through your particular website and page associated with your website, but obviously, you want to focus your effort, for example, on your homepage, which doesn't surprise you. This is where the most users are coming, and if the average load time is really high, you want to optimize that. Even if it's faster compared to your other pages, but this is a page you want to really optimize because this is where a lot of your users are coming in. Right, many of our viewers, actually, probably have multiple platforms or technologies that are actually composite components of the overall website, right? So you're trying to break out the feels about a domain because you think, primary, top-level domain, blah, blah, blah, and then the name, but really, there's, like in the case of SolarWinds, there's SolarWinds Cloud, there's Portal, there's... Product pages. Product pages, all the rest of it, support pages. And so this is to untangle the nebulous, how do people feel. Like if someone says, "How do people feel "about our website?" Well, this is to say, I can tell you, based on the subcomponents of that website, what the metrics are telling us are actually great performing areas of our website or ones where we need to invest, as opposed to the way it normally works, which is, eh, people say the website's slow. Well, that's almost like, the network's down. Yeah. No. Where? Where? I can't do anything. What do I need to focus on optimizing? Right? And another great example is you can add and you can group pages, as an example, and aggregate those information here. So I'll give you an example. Let's say, I'm sure many of you guys are running or managing eCommerce site. So as an eCommerce site, maybe you want to group all the product pages together. Right. And look at the number of page views that's coming into you, like all the product pages, as well as average load time associated with that. Okay. So you can slice and dice very easily, any which way you want to do it, right? Grouped as product pages, or if you want to group certain pages that's very important for you. Maybe you want to group blogs, right? All the blog pages. Right. So you can, again, slice and dice. Or for some businesses, it might actually be by business unit. That's a great example. So then you'd recommend doing custom dashboards by group for those users? Yup. That you could actually have different executives in different areas of the business with their own dashboard? Yeah, exactly, and the different ways that you can prioritize it and their different subcomponents of your website that different personas want to focus on, you could provide those customizations for you for Pingdom. Okay. Then, obviously, scrolling further down, you have the load time, again, load time associated with different countries and different platforms, very easy-to-use metrics and visualization that you can use as well. And obviously you have different page views and the active sessions, again, the thing that I wanted to highlight here is the different ways to slice and dice information. And one other thing, you know, we've done a lot of user engagement and customer interviews to come up with this great digital experience monitoring. One of the features that I like a lot is the fact that, see that green line here, in terms of pages views associated with it, and this gray line here is what that was looks like exactly 24 hours ago. So you can start to compare in terms of what this page view looks like now, versus 24 hours ago, so you can get a different perspective. Now, can you just do it against 24 hours ago? No, you can do things like you want to see the page views across, let's say, seven days, right? And once you start to see this information seven days ago is what did that look like exactly a week ago, right? Right. So what the blue line sees is your timeline for the last seven days, and what your gray line shows is what that information looked like seven days ago, right? Got it. So it shows you different correlation and you can do that if you go back to 30 days, so 30 days now versus 30 days ago, right? So it's a great way to correlate in terms of the page views, the active sessions, and your performance information on your website, against what you're getting now versus what you were getting before. It also makes it easier to answer, or make curiosity self-serve, right? So if you have someone especially, maybe they're part of the marketing team and they want to know, "How are we doing? What's the trend?" It lets them go in and actually interactively explore that on their own without saying, "Hey, would you generate me yet another report?" Yeah, it's sort of like, you know, "Website seems really slow today." Right? You're like, what is that perspective of? What did the website performance look like 24 hours ago? Right. Or even seven days ago? And then they go answer that question, but they never open a ticket, they don't ever send you an email. It just seems slow, right? But they can actually go and say, "I look back seven days ago, "I look back last month, it wasn't spiking like this. This is just so strange," they say out loud. And the person next to them turns and says, "Well, remember, we did that promotion "that drove all the traffic?" "Oh, yes, actually, historically, "this has been performing great." But it lets them figure that out on their own, without going to ops, without going to operations and saying, "We want to do some reporting." Do your own reporting. Yeah. You don't need me to do that. Come here, I'll give you the link, it's easy, anybody can do it, even my mother can do it. [Both laugh] I need to deliver these services. I don't need to get in the business of writing their own reports, it makes it self-service for them. Okay, great, so what you're seeing here again, just to, kind of, walk through the process again, one, do I have an issue on my website? Who's having this issue? This is exactly where we're seeing it, and what's the root cause? So let's look at the root cause. So what I do here is I go to the top, I click on the performance tab, and this basically tells you the basic root cause. So when people say, "Well, the load time is..." Page load time on average is 2.11 seconds on my website, right? 'Kay. The thing that's important is that what component of the page load time is related to the front end or the back end? Sure. The front end meaning if someone's looking at the website from a bus on a 3G mobile device, yeah, it's going to be slow versus someone that's on Google's gigabit network, right? And they're using desktop, right? Right. So this shows you, yeah page load time is 2.09 seconds, but the back end component of it, which is the time to first byte, that's the server and application, the back end has presented that information to the front end, it only took 0.39 seconds. Majority of the time was 2.90 seconds was spent on the client front end. Right. So it's really the browser processing different JavaScripts and the different CSS loads associated with that, so you, as a DevOps and ops can say, "Yeah, you know, "the page loads," in this example, the website is loading really fast, but let's say it was eight seconds. Right. Website seems really slow. I'm looking at this metric and you can go back and say, "Hey, on the server end, "it really took less than a second." It's really the client processing it that's taking a long time. That could be related to different component, again, it could be someone that's riding a bus on a rural road, and they're on a 3G mobile device. Well, and also, it lets you look at overall experience by complexity of the application, right? Because if I can come into these slices and immediately break this out, right? So time to first byte, that's basically TCP connect time, right? Think time, that's how much time I'm actually doing processing on the back end. Load time, how long does it take to get it, but client processing time? If I'm super-heavy client-side JavaScript, I've got a really rich interface and it takes a long time for that mobile device to process it. If the experience, if their association or their assessment is, "This thing is slow," well, that might tell me I would rather trade more back end processing to simplify the data that's actually being sent out, or I need to slim down my application or consider maybe a different framework, because that is a tax that's being placed on the user experience that's really not anybody's fault, it's just the devices that are distributed in the field for whatever reason are spending a lot of time thinking about what I'm sending them. Exactly, and this is why the end user or digital experience monitoring is really component. Like starting to break that down and say, "Okay, user's facing a really slow website, "and people are saying that our website is slow," right? Break that down, like who within my audience is having that issue? Right. And if you know that typically having that issue is what is it related to? Is that the client's front end, is it that back end, or is it the network? Right. So gives you a quick information associated with that. It's experience, not time. Yeah, just an experience, right? And not only that, you can go further down. So you can go and say, "Hey, let me break this down further." And it's like, is it, again, the front end side, was it the network side, or is it the back end side, right? Nowadays, we center discussions around net neutrality, maybe people start to throttle things down on the network end, so maybe-- That's brand new, that's never been going on. [Michael laughs] So maybe this is related to the network, right? But obviously, a lot of the time, for this particular example, it's most of the processing time in terms of the page load time we associate with that is the client front end. And we can break this down by rendering time, dome, and you can get different breakdowns, right? Even on the back end, what's the back end related to send or receive requests? Right. And you can even break that down as well. So it gives you information from high level, right? To the page load time into the detailed breakdown of the root cause behind your website. All right, so this is, I don't want to say this is RUM-plus, but there are elements of RUM here in terms of being able to actually have metrics that go all the way out to the browser and back, but it also includes the business metrics. Now, I think, for a lot of us when we have been able to get those metrics, what we have to do is customize applications or do code injection or something else, so how is this different? What do I have to do to my applications to get them instrumented like this? That's a great question. Again, going back to what you said, this is visitor insight or digital experience monitoring, is all about providing that real user experience monitoring, but what's really unique and different about it is the fact that we can add the business metrics associated with that. Right. Things like bounce rate, things like session. So if I'm having a problem with my website and try to correlate whether or not that website performance has real direct impact on the business side, we can actually show that, right? And the thing is that we can provide that information in really simple, intuitive view while, like I said, anybody can come in, even the business stakeholder can then see that. Right. Now, how do we make this really happen in terms of the installation process? Great question, Patrick, it's really simple. It's a one-line JavaScript injection, and I can show you what that looks like. That'd be great, because you know I'm always going to go to code. Yeah. So where you add that is, again, go to the very first page, right? And we showed, in terms of different websites that we're monitoring. M-hmm [affirmative] And what you do here is add website, and you type in your URL, so let's say maybe we want to monitor appoptics.com. There you go, monitor the monitor. Yeah, monitor the monitor, right? And what you do is here, all you need to do is inject this JavaScript injection, right? Send a code and instruction to my login email and that's all we have to do. Okay, what I really like is being able to, if I want to cut and paste this, this is great, but a lot of times, you're going to send this to a developer, or maybe it's someone who is a developer and they are monitoring a particular application, so you would've given them a login anyway, so they're logged in as themselves. By checking that box, it's going to send the snippet, but also all the instructions that they need to them as an email, so if they need to send it to somebody else, if they want to throw it up in Slack, if they want to put it somewhere where someone else on the team can get it, it makes it easy to distribute that. Absolutely, it's an easy way to send that information, and if you need to send it to your admins or your servers that's not you, it's really easy. Right, well, I'm always going to want to make that part of my build process, I'm also going to want to test that, right? Because you're a developer at heart. At heart, but the metrics that are available here aren't just on the dashboard, right? I mean, you've got an API that you can get at all these as well, right? Yeah, if you want to have a real easy way to use the API component of the Pingdom, yeah, we have that available, we have public APIs to do exactly that. Because, in my mind, the way that you do this is you're going to add this, the development team, obviously, is going to be involved, if not leading, adding the RUM and code into the application, right? Exactly. So if, as a part, if I have as a part of my test suite that I've built for that app, I have an automated test that actually goes to look based on the content for this monitor. It actually goes to look, am I getting data back off of my dev side, for example. Or you're doing blue-green deployment or limited functionality, where some user groups are getting it and some of them wouldn't, you'd be able to programmatically test to see that you were getting metrics back as you would expect for those users as well. So not just, am I getting metrics? But as the developer who's responsible for injecting this code to make sure that I've got that rich metrics dev, I'm getting the business-level metrics, that they're always going to be in there every time I deploy, and so then if I'm the ops person, I don't have people run down the hall and say, "Hey, I need you to go in and modify a source." No, no, no, you gave me the deployment package. I don't want to modify that, that's up to you. You check in the change and then automatically see it be deployed. No, exactly. And this is what's great about Pingdom Visitor Insight is the fact that with that one line JavaScript injection, you can send it to anybody or you can do it yourself, and immediately thereafter that, you could start to see some of the key performance insight that you were looking for, and you can start to share with someone like CMOs, right? And say, "Hey, is our website--does it seem slow? Is it anybody within a particular region or country, or counties, that's having a problem?" And how do I correlate those business metrics like active sessions or bounce rate with the performance insights, like page load time, right? And slice and dice that information across different pages. Right. And different categories within your website, right? Okay, you know me, this is great, and I love the idea of being able to put up dashboards for the CMO, and then I'm going to be able to use them to actually make a difference, but I like to actually experiment first to decide whether or not it's helpful and sort of how it works. So for our audience, you guys are telling us increasingly that this is a pretty big challenge, right? You might be an all-in web application shop that spends most of your time in operations deploying and delivering web applications, or maybe just a little bit of what you do, but I want them to have a chance to experiment with this more than just a couple of weeks. So how can they try this out and see if it's helpful and also learn about how digital experience monitoring works? Yeah, actually it's simple. So we created a special URL for the audiences, and what you can do is go to www.pingdom.com/solarwindslab. That's right, so pingdom.com, there is actually a SolarWinds Lab URL at the end of that. So go by there, it'll have all the details on how to do it, but the cool thing is, they're actually going to get 90 days. 90 days. To experiment with it. And we're doing that especially for this episode, because we really want to know how this works for you, because you're telling us, certainly at re:Invent and the chat that we had on the last episode that we did, that this is increasingly a problem, even if you think that you're otherwise entirely on prem, this is actually a challenge that the most classic enterprise-y enterprise customers are beginning to have, so I would love to get your feedback on this and to talk a little bit more about the challenges that you're having to continue that dialog. Yeah, and I just wanted to highlight that it's not a subset or a limited functionality, it's a full functionality for 90 days, and at the end of the day, we want you to be successful and try it. Awesome. Try it like what we talked about. So that is digital experience monitoring. Well, you know what I'm going to want to talk about next right? Logging. Exactly. Logging is maybe not a central preoccupation of most of our audience, but it doesn't matter whether they have monolithic infrastructures that are running on a small set of servers in their data center, or they are all out on the cloud or something in between, or they are citizens of the API ecosystem and the API economy, but there's a ton of monitoring that just has to be done. You know, you brush your teeth every day, you need to be monitoring all of your systems in a way that you can actually capture events that will let you do troubleshooting, especially after the fact. It is increasingly complicated, and I think it's a question that you guys have been asking a lot, which is why would we have two different logging products that are part of our cloud portfolio? Is that just for a cloud, does it work on premises? What's the point? And so I wanted to take a chance and take a little bit of time and walk through that, so what is the difference between Papertrail and Loggly? And how does that fit in with the other challenges of grabbing events in a distributed fashion? That's a great question, Patrick, and I will go a step further, so let's take another step back, all right? When we say logging solutions, SolarWinds has a fantastic logging solution that's based on on-prem, so this is where I would start. So if you have a logging solution that you want to store those logs on premises, we have a fantastic set of products that does that. Now, if you want to have a logging solution that's SAS-based, where it's okay for you to aggregate those logs-- M-hmm [affirmative] And send those logs and be it in a simple, easy to monitor in a SAS environment-- Right. Then we have solutions for that as well. So for those solutions, we have Papertrail and Loggly. Now, what's the difference between Loggly and Papertrail? Papertrail is a SAS-based log aggregation solution. M-hmm [affirmative] Meaning, if you have multiple places where you are generating logs, whether in servers or applications, in multiple places, if you want to see it in a single place-- M-hmm [affirmative] Get it up and running in minutes, and be able to do a live tailing or troubleshooting use cases, right? A lot of times, how do people use logs? People use logs if there is something goes wrong, right? And if there is a troubleshooting use cases, you don't want to log into three, four, eight different places to look at logs, right? You want to see all those logs in a single place where it's easy to look at some of those logs associated with different troubleshooting in case you want to do and that's what Papertrail is great at. Well, I don't want to say that it makes me lazy, but it makes me lazy, because the challenge with log aggregation has typically been you know that you need to capture events that are coming out of your systems, you just do, and then you’ve got to figure out where you're going to send that, so then that gets turned into an infrastructure question and I'm going to have to build something that is my aggregator and now that's effectively big data if I start sending enough to it that it's actually doing something useful, and I've got to maintain that and I've built one more thing, and then sooner or later, you just get to where you say, "You know what? "I'll come back to those logs and I'll get them later." That spark of that moment when you said "I should log this" is lost because you're thinking about the weight, the tax that's going to be incurred to actually go build a system that's going to let you do that. And so what I've come to discover is that SAS-based logging lets me have that moment where I have the spark and I say, "I need to log this," especially if it's a Docker container or something else that's going to be horizontally scaled or it's going to come up and down based on an orchestrator and I'm not going to have time to dig into that thing in flight and figure out where to apply logging. I go ahead and set my configuration for that to send that off to Papertrail and it just magically appears and then I can go back to it and I can search and I can basically tail through that big aggregated log, I can search for individual systems or sorts of events and figure it out later, but it means that I start capturing that data immediately, when I was not only in the mood but really had the opportunity to invest-- That you really had to do it. Mm-hmm [affirmative] Sort of takes all the pain away. Exactly, exactly, and where Loggly comes into place is you need something more than just live tailing for troubleshooting use case, you know, something was wrong single case lock, and you need something more, meaning you need to look at those logs and do some analytics associated with that. Let me look at the logs over six months or a year period, and let me look at Apache or Nginx logs and different error codes that are associated with that. So you're talking about actually extracting data from the log data themselves, not just log frequency, but actually the JSON that's part of a metric that's coming back as a part of a logged event, actually being able to chart on that. Absolutely, I'm glad that you brought up JSON, right? So let's look at it, people are looking at logs, a lot of times nowadays, it traditionally has been syslogs, right? M-hmm [affirmative] It's a simple way to look at logs, but a lot of people are moving towards more the structure logs, like using the JSON format, using that format to do much easier filtering, slicing and dicing data, of course, in order to start doing that, you need to be able to do a parsing the right logs as they come in. Right. And once you parse that, do that, you can see it in a nice graphical view, in terms of different logs and start doing analytics associated with that, and this is what Loggly is really fantastic at. Okay, so let's do this real quick. I want to just briefly show Papertrail, and then let's go to Loggly and kind of compare the difference between that live tail aggregation, put it all in one place, and then taking that next step up to be able to generate metrics and dashboards off of the data that's contained in those logs. Yeah, let's do it. Okay. So most of the time, when I'm in Papertrail, I'm using the event view, which is what we're looking' at right here, so this is a live tail of all of my aggregated logs, right? Now, when you're in production, of course, this just streams by and you can't really see that much, because if you have millions of events that are coming in per day, that's going to go by pretty quickly, so it makes it easy to go in and I'll just slice it by maybe a component of, or a component of an application or maybe drill it down to a particular device or an IP address or, a lot of times what I'll end up doing is saving a query that will pull just the data that I need, like maybe out of my AWS infrastructure, for example, so that I'm only looking at those logs. So it effectively gives me, lets me tail something that's running somewhere and I don't need to worry about where it is, it's going to be coming up here and then I can search it out after the fact. Absolutely. But what I really like about it, as I mentioned before, is what got me to this was the first time I went and added, and I recommend everybody, there's a free tier here you should definitely check out, but adding a system is really straightforward. You're going to click Add Systems, so for example, here this is just regular OS logs off of Linux, all I'm going to do is just cut and paste this script directly into my environment, or more likely, you're going to have Chef or Puppet or some other automation tool actually apply that. In the case of an application log file, it's going to, actually in this case, it's a remote syslog. It's going to give you the setup for that as well. Yup. So I don't need to figure it out, and in this case, if I want to do custom config files, there's some other options that I could do as well, so it makes it really easy to get from, "Hey, I wish I had data "that was coming from that environment to a dashboard "where I can actually see all of my events." Exactly, and it's a SAS solution, right? So you don't have set anything up, install anything, all you have to do is sign up, go to your configuration and start pointing syslog and there you go, and this is what you're going to get. Right, so I've been playing with this for a long time and actually, I showed, at THWACKcamp this year, you guys saw that I was monitoring Orion with this doing alerts on it so I was monitoring, I was basically exporting a heartbeat and if that heartbeat stopped, then it sent the alert. If it said I hadn't seen this heartbeat in five minutes so that I could know a remote instance of the Orion platform server was running out in AWS, so it ends up solving a lot of edge cases, so it's a little bit of a Swiss Army knife for me, but Loggly's a little bit different, right? Yeah, let's take a look at Loggly, right? While Papertrail was about looking at the syslog and live tailing and seeing all the logs immediately, right? So syslog is a bit different, so let's say you're sending a log type of Nginx, and the great thing about Loggly is you can detect that you're actually sending this Nginx log type and it's starting to parse that information, as the logs get ingested, we do auto parsing on it and what you can do is you can start filtering information automatic view and you can start to get these dashboards out of the box, so things like if you're sending a Nginx log type with detector, it will provide and build this out-of-box dashboard for you. So you may be looking at Nginx by different request type method. Is it GED or POST or HED type of log for Nginx? Or is it by Nginx statuses? Is it 200, 301, 302, 404? The great thing about it is you don't need to do anything. It's smart enough to know you're sending an Nginx log types, we're providing a dashboard, information for you. And if you want to look at this information, like I said, across 24 hours, across a month, six months, a year? You can get all that information through Loggly. I don't have to export it, pull it into a CSV file then process it? Yeah, exactly. Okay, but don't underestimate the relative complexity of trying to untangle the messages that are coming in for logs, right? Yeah. Because I made it look really simple before in the Papertrail example, where I basically said, "Hey, we're just going to go get our remote syslog," right? Yeah. Or maybe if I'm using system D and syscontrol and I'm using journal control to get my logs, I'll just go ahead and apply the exporter for that, but trying to figure out what's coming out of a aggregated log is a trick, because I don't want to have to go tell every single one of these feeds, or in the dashboard that we saw a minute ago, that that's actually Nginx, right? Yup. So when you look at some of these dashboards, and I'll pick one here that's a little bit different. Okay, this one is actually looking at a lot of things. So we're looking at things like back-enders that are coming off the systems, whether it's clock drift or bad time stamps or we're looking at anomaly detection, big, huge help in going through and finding literally those needles in the haystack-- Huge. A lot of log use cases. Right, traffic volume or I want to break out my production errors that are running in AWS. I've actually built a dashboard that represents a large portion of my operation, or something that I actually care about, so the question then is how did I tell it what these data elements were? And so, to your point, what Loggly is doing here that's a little bit different than what Papertrail is doing is it actually is aware, so when you look at the integrations that I can add, they are much more diverse. So not only multiple services that may be part of AWS, but also stacks and components, whether it's Docker or it's whole stacks or it's languages. Or JSON, right? Structural logs. Or it's JSON, it makes it really easy for me to add those. It also makes it a little bit easier to go beyond basic search for where I would normally go in and search for a list of individual elements in that tail that I'd be looking for, but actually, if I wanted to go in and look for data that is coming out of that log that's actually broken out into those charts where I can go and look at the individual components that are part of that and then see a chart for activity. Exactly, the key here is being able to parse these logs that's coming in, either structural, especially with structural logs, and be able to get to this type of dashboard out of the box, whether or not it's different types like Nginx or it can be other log types, as well as having a custom dashboard that you can set up and see some visualization associated with anomalies, right? M-hmm [affirmative] Sort of the analytics associated with a longer period of time, that's what Loggly's great for. Yeah, anomalies from my Golang app that's running in a Dockerized container on Kubernetes, I can actually see where those issues are coming from and it's aware of those for me. And again, SAS-based solutions, right? It gets you up and running very quickly. It does, and then the other thing is, it's sort of like the grand, not the granddaddy, it's the big brother, I think, in a lot of ways, to Papertrail, and the one thing that you guys have been asking is does that mean I can't do live tails anymore, and you absolutely can do live tails as well, so that's a part of the technology that's included in Loggly too. Yeah, and you can do live tailing, but as you can tell, some of the things I think Papertrail does really well is that you particularly use cases around the log aggregation and live tailing, which Papertrail is really good at, so again, if you're a customer that's looking to aggregate logs and want to see the live tailing and be able to do quick troubleshooting use cases around with it and be able to see the log quickly and see that information and be able to search and filter, Papertrail's great. If you're looking for information like you need to do maybe structural logs and be to do analytics associated with it and basically be able to detect it for log types that's coming in, Loggly's great for that. And automated dashboards based on the data that's coming back in those logs. Yup. Okay, well, Loggly has a huge following, lots and lots of customers love it. I'm going to be out in San Francisco shortly and spend a little bit of time with that team. You like to make a joke that you're going to be able to get me to move out there, but... Still trying. Still trying, but maybe. And I'm really looking forward to working with that team and also spending a lot of time talking to you because we've discovered that a lot of our existing customers for some of our other SolarWinds products are also Loggly customers and have been for a long time. Yup. So we talked about digital experience monitoring, we showed how to do it in Pingdom, but at a larger level to think beyond regular user monitoring and synthetic transactions, but to really think about integrating business data as well, and one, how to easily aggregate logs in general, but two, to start thinking about getting metrics that are actually usable from the data that you're aggregating with all of your log information events. Exactly. Well, thank you, Michael, for walking us again through what the difference it between DEM and traditional web-performance monitoring. It is also always great to have you on the show, especially when we have a chance to answer the questions that have come from previous episodes, like some of the other ones that we've done on cloud. Patrick, you bet, and hopefully you learned a little bit more about how digital experience monitoring is different than infrastructure, application stack, or user-experience monitoring. Yeah, I think so, and especially if your chat is anything like the last time we did a show together, you've made it clear that over the last 18 months that it doesn't really matter where your applications are running. They might be in the cloud or just on premises with a lot of hybrid, or maybe multiple clouds or a lot of connected services and APIs, but simply watching URL response time really just isn't enough anymore, because when applications aren't performing well, the very first question that you're going to ask in order to resolve that issue is, "What's the performance of its constituent elements?" That's right, and please keep your questions coming. This is an evolving area of IT and DevOps, and we're really interested to know more about digital experience monitoring challenges in your data center. And if you don't see a live chat window to your right, that's because you're not live with us. Swing by our homepage at lab.solwinds.com and check the schedule for our next live show. And Patrick, you know you're always welcome to come out and work from the San Francisco office again, I promise, we won't make you move from Austin. Well, I already have an EV, well, it's a Chevy Bolt, not a Tesla, so maybe that doesn't count, but anyway, I'm Patrick Hubbard. I'm Michael Yang. And thanks for watching SolarWinds Lab.

Tweets

SolarWinds's Twitter avatar
SolarWinds
@solarwinds

Super charge your #syslog messages. Check out these simple secrets to create detailed, actionable syslog messages y… t.co/a8NRibdHKu

SolarWinds's Twitter avatar
SolarWinds
@solarwinds

A monthly podcast where the brightest SolarWinds minds and IT industry influencers come together to share stories,… t.co/keG1yb4cv8

SolarWinds's Twitter avatar
SolarWinds
@solarwinds

Deploying a change soon? Get a sample of what to expect and how to overcome potential hurdles. t.co/ClUNEKrfCF