The Marvel of Observability
“[You’ve] been fighting with one arm behind your back. What happens when [you’re] finally set free?” – Paraphrasing Carol Danvers, a.k.a. Captain Marvel
BOOK REVIEW: How to Architect and Build Highly Observable Systems by Baron Schwartz
Observability is a property of an application or system, not the actual act of analysis. The system is observable, practically and mathematically, if you can understand its inner workings and state by measuring its external behaviors. That means the system exposes telemetry, which is the data emitted from instrumentation that expresses those external behaviors—a feature ideally baked into your code upfront. Monitoring is the act of analyzing the telemetry to see whether the system is functioning correctly. And diagnostics is the process of determining what’s wrong with a system.
These definitions are starting points for the 2019 edition of How to Architect and Build Highly Observable Systems by Baron Schwartz, which offers practical guidance for getting the visibility you need to optimize your systems and achieve heroic performance gains.
“Just Because Something Works Doesn’t Mean It Can’t Be Improved” – Shuri of Wakanda in Black Panther
It’s clear in Schwartz’s guide that the telemetry exposed in a well architected, well operating system is specific. It comprises seven “golden signals” (CELT and USE) that can clarify external quality of service and internal sufficiency of resources. With this telemetry, developers are able to measure whether and why users get correct, fast answers to requests, or not.
In a well regarded essay, distributed systems engineer Cindy Sridharan reinforces this idea when she notes that observability…
“… aims to provide highly granular insights into the behavior of systems along with rich context, perfect for debugging purposes. Since it’s still not possible to predict every single failure mode a system could potentially run into or predict every possible way in which a system could misbehave, it becomes important that we build systems that can be debugged armed with evidence and not conjecture.”
CELT and USE with query-level visibility enable that kind of granularity and context. By building a more observable system, teams have a fighting chance to swiftly combat service degradation — and any Thanos-level disruption that a critical outage might cause. Schwartz’s detailed insights around pitfalls to avoid, best practices to target, and how to craft an application’s database workload are useful to software developers and tech leaders seeking to improve systems performance and free their organizations from the costs of ongoing technical debt.
“You’re Embarrassing Me In Front of the Wizards”- Tony Stark to Bruce Banner
It would indeed take a superhero to build perfectly coded instrumentation for observability — but for mere mortals the pitfalls are tricky and the consequences of not avoiding them can go well beyond professional embarrassment. Logging, for example, can slow problem-solving when log levels are confusing. Schwartz effectively questions the overlap within log taxonomy and notes that log messages useful for writing and debugging code or for operating it are the kinds of messages that aid developers. Plenty of commentary in the past several years, and more recently, highlights the associated importance of structured logging, distributed tracing, and trace portions for observability.
Other pitfalls are common, like system status output that doesn’t distinguish between status variables (actual state) and configuration variables (desired state), backward compatibility that breaks due to renamed variables, and services—especially common database services—that don’t offer any way to inspect key telemetry items (some among CELT and USE).
Monitoring tools themselves can cause pain, too, if they don’t have effective flap mitigation, alert consolidation, alert cancellation, and alert suppression for scheduled maintenance. Choose tools wisely is Schwartz’s implied mantra.
“Jarvis, Sometimes You Gotta Run Before You Can Walk”- Tony Stark
Stretching Tony Stark’s sentiment just a bit, acting on production needs from the very start rather than waiting to address them later, is crucial in Schwartz’s observability guidance. He notes:
If you build your application to be developer friendly, but ignore how it runs in production, you’ll likely end up with an app that is harder to deploy, operate, and observe. Ease of development and operability aren’t mutually exclusive goals—but if operability is an afterthought, you will probably make many decisions that do limit your options and freedom later.
Fellow experts speaking and writing about observability strongly concur. They remind us: “The truth is that most modern services currently operate at such levels of complexity that it’s not really a matter of ‘if’ you’re testing in production—you are.”
“This is … Nothing We Were Ever Trained For – Black Widow in The Avengers
Other, more specific best practices for observability are generally not something developers walk through the door knowing. At QConSF 2018, Schwartz pointed out that, “a lot of database automation and database tooling, [and] operations tools are very fragile … it’s not necessarily built by people who have experience running at scale, and it will cause critical outages.” So, there is clearly a cultural need for teams to onboard by teaching best practices and communicating them frequently in constructive ways.
To inspect applications meaningfully at runtime, implement two key practices — enable Go profiling and build a processlist (the foundation of workload observability) — that enable them to answer questions like, “what requests are in flight across all of our services?”
Moreover, measuring database workloads is complicated because services typically have very high event rates. Recording all requests—including metadata like SQL, user, current database, origin hostname, timestamp, latency, error codes, and so on—is overwhelming. As a result, the best practice is to digest away the variable portions of the SQL or other command text, creating an abstracted statement. This means queries can be grouped into categories or families, which allows metrics about the categories to be generated. Since the reduced data sets from digests are still quite large, SolarWinds® Database Performance Monitor (DPM) offers a blueprint for how developers should write queries and how to make an application’s database workload easy for a monitoring system to digest. Handy stuff.
“Thank You, Sweet Rabbit” – Thor to Rocket
As a successful systems engineering team dives deep into the rabbit hole, it becomes clear that “some databases, especially those that don’t have good built-in observability, [true of most open source databases … ] might have to be instrumented through methods such as network traffic capture or log file analysis.” The eBook lays out a “guerrilla troubleshooting” list, convenient when your team has to turn to sniffing network traffic.
“There’s Still A Lot of Work To Be Done” – Nick Fury to Captain America
Finally, one of the most compelling pieces of advice in Highly Observable Systems boils down to prioritizing customer-facing failures and the customer point of view. Thoughtful observability, engineered from system planning and design onward, reduces the chance that “minor details” in code end up as user pain with expensive business consequences.
The best thing to do is read this updated ebook yourself, apply the best practices and avoid common pitfalls in order to create a highly observable application that you can optimize in order to better serve your customers. Download your free copy today.