Application Performance Monitoring (APM) is taking the cloud by storm. 2019 saw very successful IPOs from DataDog and Dynatrace with multi-billion dollar valuations, as well as the $1 billion acquisition of SignalFX by Splunk. Longtime market leaders like New Relic and AppDynamics also continue to be strong players.
Your company is probably using at least one of these tools, if not several, to monitor everything from CPU utilization to API calls to container events. APM platforms are widely extensible and plug into integrations from hundreds of commercial and open source components and services. Whatever applications and services you’ve deployed, there’s at least some monitoring you can do with an APM tool.
APM tools are fantastic. I’ve tried living without them, and I wouldn’t want to do it again. However, if APM is all you’ve got, then you’re missing out on key information and troubleshooting capabilities that could help you prevent outages and downtime.
Standards and Special Cases
Comprehensive application and infrastructure monitoring of a production environment is not a trivial investment. I’ve worked on business cases for large APM investments to replace outdated homegrown monitoring solutions. The numbers were a bit scary, but DIY monitoring had lots of hidden costs, and the ROI of implementing proper tools was real. Once you’ve committed to a significant APM implementation, there’s a lot of pressure to maximize your use of the APM tools. Along with that may come resistance to other kinds of monitoring. In addition to cost consolidation, standardization is a logical path to achieving a “single pane of glass” view of your systems and sharing information across teams.
However, if your business relies on mission-critical, data-centric applications to serve your customers, database performance monitoring (DPM) is the most important area where you really need more than what you can get with APM tools. Third-party market research shows that at least 70% of application performance problems can be traced to issues at the data layer. This isn’t too surprising when you consider that workloads that manage persistent state information are among the toughest to scale, particularly for applications that were not designed with distributed computing in mind.
Let’s look at how application performance monitoring and database performance monitoring complement each other in the context of incident management. For simplicity, we’ll use the DevOps incident management lifecycle from the folks at VictorOps, but this can be applied to ITIL and other more elaborate incident management processes too.
Incident Management Lifecycle
Detection – Determining That an Issue Has Occurred
- Early detection is key to mitigating the impact of an incident on your end users. Properly deployed APM tools are great for detecting problems as they start to develop instead of waiting until things have gotten really bad. If you’ve defined good Service Level Objectives (SLO) for your applications, then you can sound the alarm as soon as performance starts to fall below the SLO thresholds.
- Database performance monitoring also plays an important role in incident detection. There are a lot of early warning signs within the data layer (job failures, disk space filling up, database blocking, etc.) that a DPM tool can detect for you with minimal configuration. Our Subway case study shows how a team of DBAs used SentryOne SQL Sentry to diagnose database performance problems 12 times faster.
- Having both APM and DPM capabilities deployed gives you the fullest picture of situations that need your attention before they escalate into service disruptions.
Response – Routing an Incident to the Right Place
- APM tools give you a comprehensive view of layered application stacks.
- Database folks have joked for years that DBA stands for “Default Blame Acceptor.” If the root cause of your issue is not at the data layer, you may waste valuable time. Customers tell me that they’ve used SQL Sentry to prove to colleagues on other teams that the root cause of the issue is not at the database layer. However, we saw that over half the time the problem really is in the database, which brings us to…
Remediation – Fixing the problem
- When it comes to resolving issues that reside in the database layer, DPM solutions are truly irreplaceable.
- The biggest advantage of database performance monitoring is query-level analytics. APM tools focus on instance-level metrics that quantify the overall performance of a SQL Server. In a typical database incident, the root cause is often one or more queries that need to be optimized. Let’s say that you’re struggling with issues caused by database blocking. An APM tool can tell you how many processes are blocked. SQL Sentry can show you exactly which processes and queries are blocking the others so that you can understand exactly what needs to be remediated.
Analysis – Documenting What You Learned
- The historical data and reporting capabilities of DPM solutions enhance the root cause analysis process and make it easy for DBAs to help stakeholders visualize and understand the issue. Writing up the details of an incident after it has been resolved takes time and effort, but it’s essential if you don’t want to keep running into the same problems.
- When you’re writing up an incident report, a picture is worth a thousand words. Graphs and visualizations are especially helpful to team members who are not as hands-on with your systems.
Readiness – Putting Analysis Into Action
- Once you’ve learned from the incident, the final step is to implement improvements that mitigate the risk of repeat incidents. DPM tools give you the ability to automate your response to some problems. In addition to generating alerts, SQL Sentry also lets you configure a variety of automated actions. You can automate your remediation via T-SQL, PowerShell, SQL Agent jobs, or any external command or script.
- The final step beyond automated response is to implement optimizations that reduce the risk of future incidents. The trending and baselining capabilities in SQL Sentry make it easy to understand where you have the most risk. For example, if the incidents that you’re seeing are symptoms of an undersized system, baseline data can help you understand how to upsize accurately.
Where to Go From Here
If you’re using APM but still need the key information and troubleshooting capabilities that could help you prevent outages and downtime, SentryOne can help you build a DPM business case. Check out Calculating the ROI of SQL Server Monitoring and our online ROI Calculator to get started.