Home > Hierarchical Observability with RED

Hierarchical Observability with RED

November 7, 2017 | Database

I've written before about the minimal set of metrics that can serve effectively as application/service vital signs. One such set is the RED acronym, which stands for Request Rate, Request Errors, and Request Duration. (I'll write in the future about what's missing from this acronym, but it'll serve the purpose for now). With RED, you can glance at a service and quickly understand whether it's okay. Is it in trouble? The Rate (throughput) will tell you at a glance whether it's experiencing increased traffic or not, and the Errors (also a throughput) metric will show whether there's an elevated error rate. Is it providing good quality of service? If the latency has changed, that can provide a clue (this should be a tail latency metric, such as 99th percentile). But what RED doesn't answer, alone, is whether the service itself is the issue, or if the trouble comes from one of the dependencies. In today's highly distributed apps, it's typical for services to call other services, forming a complex chain of dependencies. If the service's p99 latency is unusually high, it might just be a problem with one of those dependencies. And dependency graphs can get really complex. This is part of the genius of RED: the metrics are universal, and if you do it right, you can ensure that the following are true:

Every service exposes the RED metrics about itself.
Every service knows its dependencies upon other services. (This is outside the scope of this post).
Everyone knows #1 and #2.

Diagnosing a service's issues, then, can become much simpler. If I'm responsible for a service and it's having an issue, the process is as follows:

Examine every service I'm dependent on. Are their RED metrics okay?
If yes, the problem is in my service, and I can narrow my search.

This is much simpler than needing to know what metrics matter for any given service, about which I might know nothing. And if everyone published dashboards and just built what they thought was best, I'd get highly variable results: some services would have no dashboard, others would have complicated and confusing ones filled with pointless metrics, etc. You get the point: RED can serve as a very useful minimalistic set of metrics, and if universally and consistently applied, can dramatically simplify the process of navigating, understanding, and diagnosing complex, distributed, interdependent systems like microservices APIs.

Baron Schwartz

Baron is a performance and scalability expert who participates in various database, open-source, and distributed systems communities. He has helped build and scale many large,…

We're Geekbuilt ^™.

Developed by network and systems engineers who know what it takes to manage today's dynamic IT environments, SolarWinds has a deep connection to the IT community.

The result? IT management products that are effective, accessible, and easy to use.

Company Career Center Resource Center Email Preference Center

For Customers For Government For Partners GDPR Resource Center

Hierarchical Observability with RED

Tags

Monitor Your Citus Cluster With SolarWinds® Database Performance Monitor

My Life as IT Code: Part Four – Exercise and Sandbox

Hierarchical Observability with RED

Tags

Monitor Your Citus Cluster With SolarWinds® Database Performance Monitor

My Life as IT Code: Part Four – Exercise and Sandbox

You may also like