Home > SolarWinds Lab Episode 60: SolarWinds Cloud, Go! Learn Modern Monitoring for Apps, Cloud, and DevOps

SolarWinds Lab Episode 60: SolarWinds Cloud, Go! Learn Modern Monitoring for Apps, Cloud, and DevOps

Applications are rapidly evolving from hulking, monolithic stacks to distributed components and services, flying in loose formation to deliver resilient services users like. But the way these applications are built, delivered, and monitored is new, with additional complexities introduced by Amazon Web Services (AWS) and Microsoft Azure service primitives, Docker, and Kubernetes, auto-scaling, code pipelines, and a huge increase in deployment velocity. In this episode of SolarWinds Lab™, we go hands-on—not just with the challenges of modern application monitoring techniques, but their enormous promise to reliably deliver services that delight users, and don't just satisfy old-world SLAs. SolarWinds Cloud® experts untangle application performance monitoring (APM) vs. infrastructure monitoring, distributed tracking strategies and concerns, tailoring dashboards to unique DevOps processes, best practices for wrangling telemetry agents, language instrumentation, code troubleshooting, and more.

Back to Video Archive

Episode Transcript

Hello, and welcome to a special episode of SolarWinds Lab live from AWS re:Invent in Las Vegas. And joining me is Michael Yang, Director of Product Management for SolarWinds Cloud. Michael, how are you doing? Doing great. Thanks for having me here, Patrick. You bet! Well, thanks for being here. And this is a really special episode because I think one, we wanted to do Lab live from a show floor for a very long time, and usually the logistics don't work out. But the other thing is that with so many of the conversations we've been having with you, both from events and from our SolarWinds user groups, is that you have been telling us that you are beginning to do real cloud monitoring. So that would be actual APM, right? Distributive infrastructure, applications, a lot of cloud-native technologies that maybe aren't the traditional applications or tools that you've been using to do monitoring. But then a lot of you now are coming to talk to us who maybe never used any of our enterprise products at all. And so this is a different time for SolarWinds Cloud, right? We're actually launching a new product here. We're launching AppOptics, which is a SaaS-based, cloud-native APM and infrastructure monitoring solution and what better place to do it than AWS re:Invent show floor? Right, but driving it is a little bit different. I think we think of monitoring traditionally as a bottom-up thing. You have this long, narrow application. It's a stack. Web layer on the top and then a reasonable number of components down. There's not a lot of change to it. We almost try to keep it static. But when you look at cloud-native applications, you're deconstructing applications, whether services or micro-services, or you're using containers, and these are apps that aren't monitored at all the way you would monitor a traditional application. These aren't your traditional application monitoring, infrastructure monitoring. And what we're finding with a lot of customers— a lot of you are in the same journey— is that you have traditionally built applications in what's called monolithic applications. Right. And breaking down into different components, different microservices. Right. Just like a lot of the cloud-native companies like Amazon or Google or different companies that have gone through, and you're going through that journey as well. And when you go through that journey, you need a set of solutions that are SaaS-based so they can do these distributor applications and infrastructure monitoring. Right. And that's exactly what AppOptics is for. And you can do it maybe one host at a time, per month, or the way that you're using it is more around the host or the application, not some number of elements, for example, that are typically licensed. Exactly. Fundamentally, it's a different way of developing applications, or changing applications, but also, where you are deploying. It's not just you're deploying on your traditional, your own data center, or you're deploying on AWS, Azure, Cloud Compute, or your own private data center, wherever it may be. But you need a set of solutions that can provide you the real-time visibility that you're looking for. And it seems almost sacrilege to see this but for so long, IT— you monitor the green things and you fix the red things. In other words, you build a giant pile of metrics, especially for infrastructure modeling, right? Yeah. And then you try to extinguish all the red statuses in your environment. But that's not really true with cloud applications where you're thinking of resiliency instead of availability. It's: "What is the end-user's experience regardless of the status of individual elements?" Obviously, if everything in the infrastructure is red, it's not going to work. But if you're dynamically allocating resources, you're elastically scaling, you may have at any given time— or as part of your delivery pipeline, as part of your continuous deployment— you may have parts of that infrastructure that are not participating in the application delivery but as long as you are monitoring what users are actually experiencing, that becomes your KPI of whether or not you're delivering services. Exactly. At the end of the day, you're monitoring your modern distributed applications. You could be deploying at different places, like different clouds. And also, you want to get ROI very quickly. You can't spend days trying to go install this and get ROI. You want to be able to quickly install and start observing in minutes. Right. Now I have seen a couple of the demonstrations you've done, which looked amazing up on the giant screen with a big audience, but what we're going to do is a little bit different. I want to get Michael to actually explain how this works. So, we're going to talk about specifics of protocol. We're going to go through how to do an install. We will walk through end-to-end how this is typically used, and I want to start off with one quick stumper of a question and then we're going to get straight into here to the how-to and that is, why build something different? Why not just take existing technologies and make them subscription-based, and just push them up into the cloud? That's a great question. Fundamentally, the way you're building the applications is different these days. Like I said, you're going from monolithic applications to distributed applications. And the second thing is you want to be able to monitor those datas, workloads, and applications using a SaaS-based, cloud-native solution that you can quickly get up and running quickly. And get value and be able to monitor these modern application infrastructures. I'm not saying that I'm a little bit jealous of the dev office out in the Bay area, but the views out there are really pretty amazing, so that alone is enough to get me pretty interested in what you guys are doing. Okay, so what we'll do here is let's walk through a basic story of, you've got a sample application, right? Absolutely. That we're going to look at. Now, you could be monitoring something like Amazon.com, right? Absolutely, we can monitor. We have customers that have workloads in AWS applications and workloads and we're monitoring that today. But really large retail sites. But in this case, you've actually got one that you've put together that's a little bit smaller so we can actually walk through the layers of the application. Absolutely, absolutely. All right, well, talk to us about this application. So let's dive into it. This is AppOptics, our brand new application-infrastructure monitoring product, and it's a convergence of our two existing SaaS products, which is Librato and TraceView. And this is like a homepage, you see, in terms of you have different services. So, it's a converged application infrastructure monitoring solution, and this is the home dashboard. Now, before we dive into details, I want to provide some context in terms of what this system is actually monitoring. So, let's go over here. So, this is an example hotel-reservation site. Like with any hotel reservation site you see online, you have different listings of hotels, different pricing, different ratings, and different things we can do with. And the key thing here is that nowadays when you're building these applications, you're building what's called distributor applications. As an example, the client front end is written in Ruby. So, let me give you an example. We're at Amazon, AWS re:Invent. So, Amazon back in 2000 went through a process where they had a lot of traffic and volume that was coming into Amazon.com, a shopping site, and it wasn't able to scale. So they famously made a rule and said, "Hey, we need to break the Amazon.com from monolithic applications to more distributed applications." Loosely coupled services. Exactly, loosely coupled services. So you have a service that made the shopping cart, you have a service that made you a comparison, and you have each one of these teams that's responsible for building it as well as running it. You build it, you run it, right? So, same thing. Let's say you're building a hotel reservation site, and you may have a team that's responsible for the front-end client, right? And you may have a team that's responsible for building booking services. So, the client front-end services you saw is written in Ruby. You have a separate team that may be building this, and you have another team that's building a booking service, and they may choose to build their services using Java, as an example. Right. Or maybe a legacy service that's being provided by a third party. And you're just interfacing with that. Exactly, so maybe you're getting pricing services from other third-party services, right? Nowadays, you have different API calls to extract whichever information you need for your site. Hospitality, never legacy services. So, this is an example of distributed application. It's written in a different language and components is deployed on AWS, is running container services across these different services. So when you have these distributed services on the cloud— Now, let's say you run into a performance problem on the website. Now, where exactly is the performance bottleneck? And it's not a single application written by multiple teams in different languages and is deployed on the cloud using different container services, as an example. And if you're Operations, responsible for keeping an eye on this— Developers are always easy to get engaged in the first six months after something is deployed. And once it's been around for two years— Well, you know, we've gone on to do something else. So keeping them engaged while this evolves over time. If you're Operations, development is and should be continuing to transform services, continuing to deconstruct monolithic components of that infrastructure. So, things are changing, so that's a big part of it too, is that how do you keep it up while everything is actually... That's a perfect example. So, let's say you have a team. You saw the booking services. They're doing continuous delivery, so they may sell updates, and after they've deployed it, all of a sudden you see a performance bottleneck on hotel reservations. Now if you are a DevOps or Ops person, you need to pinpoint that. It could be due to the applications, right? And you want to be able to point out exactly where within the hotel reservation site— what that bottleneck is happening, you want to take that to the developer. And what AppOptics does is that will take it one step further and we'll help you and your developers pinpoint exactly where the performance bottleneck is within your code. So could be the infrastructure, could be the application itself, or maybe as a part of Operations, I need to do more horizontal scaling. Exactly. Okay. Perfect. So, you have this example. Let's say— let's go back to AppOptics; we're monitoring that hotel reservation site. The first thing I want to show is— The first place you may start is our dashboard. So, the dashboard you see here is that you have different metrics, right? So it kind of provides you a different business as well as technical metrics, as an example. So, as an example: for the last 24 hours, it shows you how many users were logged in, what's the average spent per user. So these are business metrics, right? Okay, but they're not just business metrics because when you're talking about distributed applications, that average bookings rate, that's not just a business value. Because if my page abandons start to go up because my service is not returning reservation options quickly enough, that's actually an operational metric. Absolutely, and if you see the average spend goes down, it could definitely be related to the application performance or infrastructure performance issues associated with that, right? So what this provides you is that it provides you a time-series database interface and visualization, and it shows you number of average users logged in, different supports that get created. But you definitely see within the last 24 hours, you see a certain anomaly for certain application response time, as an example. And these were aggregates of a different application that you have, going back to distributor applications as an example. And you also see some anomaly associated with the system overload. So, it could be related to the CPU, it could be memory. So, it's a single place to look at all your metrics, whether or not it's business, operational, or technical, and you can start to drill into some of the diagnosis and root cause behind that. So at that point, you're a DevOps and Ops person and say, "Hey, let's go into this particular set of applications because there's definitely something going on here." So at this point, we're navigating as a DevOps and Ops person, an APA component of the product. So you see here, as I mentioned earlier, we mentioned about different distributor applications, and we see here that client front-end for— We'll set this to 24 hours. You definitely see anomaly here is by far taking the longest out of all the other services and applications you have. And it's the spikiest. Absolutely, and client services, like I said, is the team decided to develop client services in Ruby. You have booking services that's written in Java, you have a pricing service, as an example, that's written in PHP. So, we support seven different languages, and that primarily covers all the major program languages for APM purposes. So, let's drill into this. So, client front-end services, right? So, I'll show you later on, but this really took three steps. Less than two minutes to get all this data and these metrics. And this is what I will call step one of the APM. The first thing you want to do from APM is three key metrics, which is average response time associated with the application services. Requests per second, which is a throughput, or any type of error rates you may have. And associated with that, you have different charts and visualizations in terms of slice and dicing that said data, and a lot of companies traditionally had built the whole APM business just using these three types of metrics, right? But nowadays, what distributed application is much more complex, and you need to dive into one additional step of detail in order to truly get the application performance observability that you require. So at this point, if you're a DevOps and Ops person and say, "Hey, you know what? I want to look into what this looks like from code-level visibility from the transaction perspective." So, what we have here is what we call heat map. So, within the last 24 hours, you see here that this is a heat map of all the transactions that occurred for client front-end services, right? And obviously, there's a spike that happened for these sets of transactions. So, you were looking at periodic bad habits, where there's a certain periodicity to an error or performance issue, you're going to see it in the heat map view, where normally that would be a really long histogram. And you would somehow try to zoom out far enough, but you're not going to be able to see it across all the layers of that stack. Exactly and what you can look at is, you're looking at the services or application from a transaction perspective and you are looking at the outlier ones. And what's so great about this is, you can start to look at what's called tracing. This is a distributed tracing. So for this client services, you had different components, and these components make up of Ruby, different app services associated. You have different Java services, which is calling a booking service. And if you go down, out of all these different services it's calling from client services, that you have different span layers or components that are taking the longest. So, what you see here is that mongoDB is taking, for this transaction, 90.94%. Right, and if you look at it up here, the bars almost look like an old vinyl record, right? So just right away, I can tell that there's a huge amount of transactionally for this one call where the total span is a single transaction, but that's how much we're hammering mongoDB. Exactly, so if you see here, the span count is 1,803. So there is 1,803 calls into... For one booking look-up. MongoDB, right? So what you want to do— so, at this point, if you're a DevOps and Ops person, is say, "Hey, I really think that application performance issues associated with hotel reservation site is really related to client front-end services." At this point, they would bring in the developers or the team that wrote the services. Okay, well, maybe. There's nothing I like more than to get a developer out of bed at 2:00 in the morning. That's not true; I was a developer for a long time. I would hate that. But if you are doing DevOps, if you are operations for cloud, part of the skillset you are developing is the ability to go a level or two deeper than that. To really understand applications, and being able to visualize this is how you're going to learn these technologies. If you've been doing relational databases forever, and Dev hands you Mongo and is like, "Don't worry about it, it's the same." It is not at all the same. And what normally you would do in a lot of applications is, you would have to set up some sort of logging from five different layers and be able to do the per transaction logging to figure out where there are errors. But you just said, "Here you can see," just a second ago, but what you opened up here, is for this transaction, for that time, this is the data that's a part of that trace data. So don't underestimate the value, as an operations professional, of being able to not have to pick up the phone and call a developer. I want to fix it myself. And having this type of visibility is awesome. And as you can tell, as a DevOps or Ops person, you know that these mongoDB is very chatty, is making a lot of database calls. And this is a place that you might want to optimize. Now, how do you want to optimize that? If you scroll down further, these are, out of all the mongoDB calls, these are the number of queries that was the most popular and most calls that was in both. For example, this particular query was in both 897 times, and this query was in both 894 times. You're not suggesting that I might want to talk to Dev about doing a little bit of optimization here. You may want to do that in the morning, but not at night. No, but you could actually do reporting on most commonly executed queries. To say, "Look, across our operations, out of 2.5 million look-ups we had over this period, 80%, 90% of my time is tied up with this one query." So instead of just saying, "Hey, you need to improve the performance of the application in the way that it's consuming database," you can be much more specific than that and actually help them be more productive. That's awesome, and one of the very key points I want to bring up is the fact that when you're building cloud applications, not only do you make calls to your existing database, but you can be making external API calls. And we can point out exactly where within your code that's invoking external API calls as well as show you in terms of how long it's taking to invoke that call, as well as getting the response back. So not only can you get visibility on your own applications but invocation as well as response times that's coming from external services. All right Michael, now so far we've been focused mostly on the application, on the distributed components of this application. That's right. But this application is not running in a vacuum. It's actually running on a set of services. So, where do I find those in the dashboard? That's a great question, Patrick. Let me show you. So I want to go to the dashboard component, and I want to show you two dashboards. One dashboard is what we call a Host Services, Host Agent Services. So these are your basic infrastructure metrics, CPU utilization, load averages, memories. Everything I'm paying for. Exactly, and you can filter through different hosts you may have across different clouds, for example, AWS ECS services. And you can filter it, slice and dice it in a time-series database solution, right? Another dashboard that I wanted to show you is also Docker. So remember, our applications are running on containers. So, we also have an ability to monitor our Docker container services. So, it shows you different Docker images you have, how many running container instances you have, CPUs by image, so you have different way., Again, this is an infrastructure component, that you can also correlate to your application component. So it's a converged application and application monitoring solution. And so in this case, especially because— And I think you guys have been telling us so much that you are now— It started about a year ago. They were telling us they were being forced to move to containers. And now, once they are beginning to actually Dockerize a lot of their services, they're kind of really enjoying that. So this could be Swarm, Kubernetes, or Mizo, right? Yeah, so it's great. We have plug-in support for Kubernetes, Mizo. We also have plug-in support for Nginx, Apache, MySQL, Cassandra. So, we have over 150 integrations that we support. And we also leverage the communities, the open communities, so you can come in and build your own plug-ins, or metrics or monitoring solutions if you'd like. And those are plug-ins for Snap, right? Exactly. So by choosing Snap, you're inheriting the telemetry, framework, plus expertise, plus a lot of you are already Snap plug-ins. So, in a lot of cases, this is going to be really familiar. It's an open-source solution, so anybody can come in, and if you want to build a plug-in for different solutions that you have or monitoring needs, you can do that as well. And you guys built one that we're going to see here in a bit that's actually pretty cool. So, we're talking about telemetry here, right? This is not just sort of polling. These are coming in from each one of the agents, right? So walk us through how these are actually installed. I feel like I've been teasing you guys long enough. This is the part that I actually want to get to. Exactly, so let me show you an example. So after you sign up, this is what you see. Basically, you have an option. You can go and instrument APM agent, or you can go and instrument a host agent. So let's say you take the route of APM, right? And let's say you have a set of Java applications, like we saw in booking services, right? So you select Java, you select the environment. It's Windows or Linux. Let's say if it's Linux, you name your services--let's say Services app one. And this is it. It just gives you three steps. First step is curl. Download this particular installation, run the script. And the service key is going to automatically link it back to my account. Exactly right. And you configure the JVM, restart the JVM, and you're good to go. And down here on the bottom, you can see there's a spinner while it's waiting for data so that you can tell that you're getting data. Exactly, and you could also go back and go through this process with different languages and different programs for your APMs. Or you could also do infrastructure, right? You could go and install the host infrastructure agent, install a different environment, let's say Ubuntu, just one-liner. And you've got Amazon's AMI right there on the top. Exactly. You can also even go as AWS CloudWatch as well. So after you do this, let's say you went and did a— So you have the AppOptics Demo Systems, and the client services you saw there. Less than two minutes, you'll be able to get this data out. Plus, the trace details. Okay, so the thing that's interesting to me about this, and I started as a developer years ago, and if I wanted to do this kind of composite monitoring. I had to figure out what the context was. I would have to come up with a login context, for example, and then grip out a whole lot of logs and hope that somehow I could tie it to get an IP address or transaction ID. And I think what's really confusing, especially with composite applications and then the infrastructure that's actually supporting that, is how you're connecting those things, right? Because you don't always know how they're associated. So the thing here that it took me a long time to get my head around is that right here, the set-up bar is here. So, we have the booking service, the application, the remote calls that are part of it. In the view before that we looked at, and let's go back to this application here for the trace view, is that we're looking here across the web transaction, database layer, all the way down into frameworks, all the way down through languages. I don't have to create associations between each one of these elements that are a part of this infrastructure. That common ID, both for the code injection, for infrastructure, for APM, for the framework, is automatically being connected by the SaaS back end, by the telemetry agent, so I don't need to go figure this out. This is as much, to me, a discovery tool as anything else. Once it's instrumented, it's going to figure out for me how my applications are talking to each other. That's the beauty of it. This is what we call auto-instrumentation. So basically, we've done all the hard work for you. So basically, those three steps that you saw for either a Java or Ruby or Dotnet application, and after you go through those three processes, we can go figure out in terms of internal workings of server applications, distributed applications across different infrastructures, and then we can go do all that for you. And that's the beauty of auto-instrumentation, that it just does it for you magically. But is it really magic? I think one of the things that's interesting about this approach, and I will say, and I have said many times to you in the past, that SolarWinds Cloud is not a hobby. These products, when we look at Librato and TraceView, these are mature products. In the case, almost six or seven years old. Absolutely. And so what's happened here is you've built a common platform to be able to manage the telemetry, the common views and the rest of it, but it's really not magic, it's an awful lot of experience and a lot of working with customers like you that have been part of these cloud products and using them for a really long time. And you were telling me that the back end of this now, for ingesting metrics, is how many? Two million metric stream per second. Two million per stream per second. And that's today, right? And there's other commonalities like you look at the new lightning search feature. Papertrail is actually built on the same set of technologies, so these are not 1.0 versions of these products. And so I know it seems like magic, and when I first started playing with it, it did feel like magic to me, and then I started thinking about it like, "No, this is just best practice that's been accumulated over a really long time." How can I do this without having to go back to university again to come up with a whole new set of technologies that I've never seen before? Exactly, we took all of the best practices, all of the capabilities that we've built over the years, built it into the product, called it AppOptics. And then we made it more powerful, simple, and really affordable. So, I'm hoping that you all realize that this is something that if you have ever been curious about APM, that there's an opportunity to really use this to learn about it as much as anything else. Or maybe that you've been doing cloud-native application monitoring for a really long time, so as usual, it's pretty easy to get started and just to check this out. So where do they go? It's really simple. Just go to AppOptics.com. You have 14-day free trial, and you can get started monitoring on-premises or cloud application infrastructure. And get value out of it in less than two minutes and configure it and ready to go. Awesome, and what's really interesting for, I think, regular viewers of SolarWinds Lab— This is something that's a little bit different, right? When you think about composite monitoring, you think about distributed application monitoring— at first blush, it might seem completely different. You might say, "Oh, this is cloud-native technology, this is something very, very different." And if you're not a usual viewer of SolarWinds Lab, you might be thinking, "Why are they spending all this time? I totally understand cloud-native, everything you just showed me makes perfect sense. I can make this work. I can put these into my containers and throw them into Kubernetes cluster and be running very, very quickly." But the thing that I think is really important is that this year, 2017, was the first year that so many of you were telling us both in surveys, out on user community, and at events like SWUGs, that more than half of you are using cloud. And not just more than half of you are using cloud, but more than half of you are multi-cloud. So you're AWS and Azure and Bluemix— actually, a lot of them are on Bluemix and a lot of Google Cloud. Especially for those of you who are kind of focused on Kubernetes. So, this is an opportunity to really learn the technology. If you're sort of in between, if you realize that as an operations professional, you really are going to need to dig into these to figure out how to demystify that mongoDB query that's maybe taking a long time. Which is very, very chatty. Right, it's very, very chatty, but when you look at the data when you can actually see the transaction details, you start to say, "Well, this is a lot like monitoring the databases that I'm used to." That's the next level of visibility that AppOptics can provide to you. It's not just your father's APM solution, which is just a metrics, associated response time, throughput error, but it gives you one level deeper. Like code-level visibility, like how many times did database [mumbles] happen for this transaction. Right, and that's really the point of these common dashboards, is that there is that blurred line of the special kids were getting to play with all the new technology who were focusing just on cloud, is that the promise of DevOps is that operations gets to take the best practices. Like in this case, when we think about the challenge of monitoring as a discipline, if it's not just, "Hey, we'll get to go find some budget and applying monitoring later." It's, "We're going to bake monitoring in." We're going to have the discipline to say, "We're going to bake our monitoring protocols into our applications, so that regardless of where they run, we're instrumented, and we have telemetry on them." And it gets away, in a lot of ways, from the traditional limitation of, "Well we'll get to it after it's installed." Or, "When something breaks, we'll find budget to do something about it." And instead, we can provide a lot of value to the business. We're bringing all the things that you love about SolarWinds, which is bringing in powerful, simple capability with affordable pricing so that you can bring more of the users into the cloud in SaaS solution. And a little bit of magic too. A little bit. So you're definitely want to check this out. Again, AppOptics.com Go ahead and start experimenting with this. The rest of the Head Geeks and I have been working on this technology now for almost a year. Me, a little bit longer, and I think I've twisted their arms enough, and it's great to watch, not just them, but so many of you who are reaching out to us and saying, "Hey, I didn't know that I could take a look at a lamp stack or that I could monitor Ruby," or that I could do any of the other things that are being requested by their Dev team who are eager to hand applications off and run them like any other part of your operations. So, definitely check it out, and I think we need to get out of the way. They're bringing in a bunch more of the forklifts, and they're going to want to get started. We're going to be opening at 11:00 tomorrow, and I think everybody's going to be really busy. So thanks again for being part of this special lab episode, and we'll see you again soon.