There’s a time-tested saying in IT: zero sensors, zero incidents. What goes unmeasured goes unmaintained, unmanaged, and ultimately unprotected. But measurement is no simple thing. What IT teams measure depends on who needs the info and what they need to do with it, so what should be measured?
In almost all circumstances, results will have to be filtered, analyzed, and summarized in some fashion. To put it another way, the metrics that matter will depend on the stakeholders involved and their specific use cases, and if that sounds somewhat non-committal, that hasn't gone unnoticed.
Google's Golden Four signals are:
Latency (time it takes to complete a request)
Traffic (the number of simultaneous requests)
Errors (bugs in code, erroring requests, and so on)
Saturation (loading of resources)
Google has defined what’s rapidly becoming the industry-standard approach: start with the
Golden Four signals and expand as necessary until key departments have the data they need. This concept has been proven out over decades and is an important part of reliability engineering.
The importance of the four signals is they apply to ALL resources supporting application (work) transactions. Every infrastructure element—servers, network, storage, databases—has latency characteristics, demand from workloads, errors, and capacity (saturation) limits. And, all aspects of applications have the same characteristics. Across the board, these signals (and the limits they inform) are a function of total demand on the system as well as the design and configuration of the system in terms of how it performs under changing demand.
The four signals are the ultimate “metrics that matter.” If they’re all “in the good,” you can be assured applications—and their end users—are experiencing optimized performance and availability.
Reliability Engineering
Google engineers literally
wrote the book on monitoring distributed systems, and the data-driven culture that prompted its creation has spread throughout the industry. Google created the concept of the site reliability engineer (SRE).
Site reliability engineering is often viewed as a very specific implementation of DevOps. Setting site reliability engineering apart from other DevOps implementations is the zealous focus on automation and monitoring, a focus shared with network reliability engineering (NRE), a discipline born at Juniper Networks.
An SRE or NRE's focus on automation and monitoring isn’t a trivial consideration and is driven by necessity. Google does everything at outrageous scale. Juniper's customers have the largest networks in the world. SRE, and eventually NRE, were both born from the realization that IT simply cannot be done at scale without automation.
No matter how big your IT team, at some point, there are just too many systems and elements to manage manually. Everything from hosts (whether physical or virtual and whether on-premises or cloud), network elements (whether physical or software defined), storage components, and disparate databases to the increasing number of dynamically changing containers, it can be overwhelming to monitor completely and effectively.
Automation makes IT at scale possible, but it has its downsides. Taking the human out of the loop also means the responsible humans won't notice when something goes wrong unless there’s an automated system to tell them about it. This is where the criticality of monitoring enters the reliability engineering discussion.
The Golden Four (and Beyond)
Unfortunately, monitoring isn’t a panacea. The data flowing through a data center (whether on-premises, hybrid, or pure cloud) on an hourly basis can be overwhelming. Even the metadata about the data (metrics) is often beyond the capability of any IT team to track in its entirety; even if they could collect it all, storing it is another matter entirely.
There are nearly infinite metrics you
can track—such as IOps, storage IOps, uptime/downtime, dropped packets, CPU utilization, memory usage, and so much more. While a lot of data isn’t necessarily a bad thing, large amounts of data that’s not understood or relevant can be a dangerous distraction—or worse, a privacy and/or regulatory liability.
What are the most important availability and performance metrics to measure? As with all things, "It depends." Who's asking matters. Feeding different departments within your organization raw data won't necessarily benefit them, and so what they intend to do with the information also matters.
Google's four signals, or "Golden Signals," are latency, traffic, errors, and saturation. They represent what the end users care about and what the business needs to care about.
Latency is the time it takes to complete an individual transactional request. Apart from "Is it working at all?" latency is the single most important signal for user experience. Today, this is measured, whenever possible, using application-specific tests to perform actions against an application a typical user or software component of a larger transactional system might perform and timing to see how long they take.
Traffic is the number of simultaneous requests occurring at any given time but can and does mean wildly different things in different circumstances, based on proximity to technology or the business. Traffic for network engineers may be things like “packets per second,” for the business it’s more likely “business transactions complete.”
Errors may be caused by bugs in code or requests failing to complete for other reasons, such as a failure of an infrastructure component. Every element of an application or infrastructure that must be running and performant to service application transactions is subject to errors.
Finally,
saturation is a measurement of load on resources. Or, more specifically, each resource’s capacity to sustain a given transactional traffic load while delivering an acceptable latency.
Each of these signals is an ideal—a concept to strive for. To determine these signals, more fundamental metrics must be measured. Consider saturation for a moment.
Every computing resource (whether compute, network, or storage related) has a limit, almost always well below 100%, where pushing utilization beyond the limit results in massively degraded performance. The key point with saturation is incremental utilization increases of, for example, one percent, don’t correlate with a similar one percent degradation in performance (latency and traffic). Performance in terms of latency and traffic (work accomplished) varies in a non-linear fashion with the utilization of the resources supporting the performance of the application involved. This is because the behavior of queuing-based systems (e.g., computing networks) is non-linear with respect to the demand/utilization relationship. The goal of measuring saturation is to help reliability engineers keep utilization below that limit while still maximizing efficiency/costs.
Systems administrators will be intimately familiar with the importance of monitoring storage, compute, and networking, but at scale, saturation also includes the health of load balancers, the content distribution network, various automation managers, and much more.
Modern applications are broken out into microservices, and microservices can live on-premises, in the public cloud, at the edge, and anywhere in between. Microservices-based applications rely on message queues to shuffle information between services, and those message queues can easily become saturated, long before the underlying infrastructure resource utilization rates approach their intrinsic limits.
Each application may need different underlying metrics to be measured for reliability engineers to tease out the Golden Four signals, and a significant percentage of a reliability engineer's time is spent determining which underlying metrics need to be measured.
In Practice
Teasing signals out of the noise of raw metrics isn’t easy, and a rigid focus on the Golden Four may seem like a departure from earlier monitoring best practices. In reality, reliability engineering is merely a rational evolution of those best practices in response to both new technology and radical scale at which it is being adopted and deployed.
For example, traditional monitoring practices used metrics about IT infrastructure resource utilization as a proxy for performance simply because there were no direct metrics for application response time or transactional throughput (traffic). In the days of monolithic applications running on dedicated, physical infrastructure, it was relatively easy to develop a knowledge base of exactly where any/all utilization limits lay in terms of how they then correlated to actual performance (latency and traffic) being delivered and how closely one was running to saturation limits. Virtualization added a layer of complexity to that approach. Cloud-hosted, microservices-based applications amplified this complexity and obscurity to a point where resource utilization-based proxies for true performance is no longer workable. Actual performance must be monitored, measured, and managed. Today, application performance management (APM) tools can provide performance information (such as latency, traffic, and errors) directly.
This may decrease the importance of measuring certain resource utilization metrics, but it doesn't eliminate the need. While most consumers of monitoring data will only care about the output of the Golden Four signals, there are usually individuals in various roles who will need that resource utilization data. This is true because the age-old relationship between performance and resource utilization hasn’t changed. Queuing Theory 101: resource contention creates non-linear behavior in terms of application performance measured by the metrics that matter.
Storage administrators, as just one example, will still need to know how much storage performance capability (in terms of ability to deliver I/O per second and GB per second throughput) to buy and when and where to install it. But the ability to measure application latency directly can radically reduce the frequency with which storage utilization data needs to be sampled. As a result, the storage required to house the information, the network capacity to transmit it, and the compute capacity to analyze it may be substantially reduced. Why? Because—in general—as long as applications are delivering acceptable latency and throughput (traffic), then, by definition, the system as an entirety is delivering acceptable performance—no matter what the individual resource utilization rates might be. But, when one can reduce resource capacity, without impacting performance, it’s possible to increase operational cost efficiencies.
Traditional metrics are therefore indispensable when judgments need to be made about resource contention. When the cumulative demand on resources exceeds total available supply, even the most optimized application will be slow if it has to “fight” for scarce resources. Traditional metrics thus remain indispensable for hunting down bottlenecks and eliminating them or for informing judgement calls about which applications are prioritized over others.
This type of analysis would qualify under the "saturation" signal and is an excellent example of how metrics that were rationally important for one purpose (determining latency) are now far more important for another (determining saturation).
Extract Signals From the Noise With APM Tools
Once you know what you want to monitor, the best place to start your journey toward reliability engineering is with APM tools. There are APM tools capable of singling out key metrics and presenting them in an easily digested manner. As you begin to rely on these tools, it will become easier to reevaluate the metrics you traditionally monitored and the frequency at which you sampled them.
Integrated, full-stack visibility—which relies on those fundamental metrics such as network, storage, database, and CPU utilizations—remains important for reliability engineers, as they inform the actionable insights that are ultimately the reason why APM tools are deployed. What will change during the transition to reliability engineering is how much data is collected and about which metrics, as well as how frequently and where it’s analyzed.
Instead of staring at real-time data feeds, reliability engineers start with automated reports or dashboards that line up various metrics chronologically. When errors, latency spikes, or other issues are discovered, reliability engineers turn to their monitoring applications, APM tools, and analytics packages to extract signals from the noise.