I've written before about the minimal set of metrics that can serve effectively as application/service vital signs. One such set is the RED acronym, which stands for Request Rate, Request Errors, and Request Duration. (I'll write in the future about what's missing from this acronym, but it'll serve the purpose for now).
With RED, you can glance at a service and quickly understand whether it's okay. Is it in trouble? The Rate (throughput) will tell you at a glance whether it's experiencing increased traffic or not, and the Errors (also a throughput) metric will show whether there's an elevated error rate. Is it providing good quality of service? If the latency has changed, that can provide a clue (this should be a tail latency metric, such as 99th percentile).
But what RED doesn't answer, alone, is whether the service itself is the issue, or if the trouble comes from one of the
dependencies. In today's highly distributed apps, it's typical for services to call other services, forming a complex chain of dependencies. If the service's p99 latency is unusually high, it might just be a problem with one of those dependencies. And dependency graphs can get really complex.
This is part of the genius of RED: the metrics are universal, and if you do it right, you can ensure that the following are true:
- Every service exposes the RED metrics about itself.
- Every service knows its dependencies upon other services. (This is outside the scope of this post).
- Everyone knows #1 and #2.
Diagnosing a service's issues, then, can become much simpler. If I'm responsible for a service and it's having an issue, the process is as follows:
- Examine every service I'm dependent on. Are their RED metrics okay?
- If yes, the problem is in my service, and I can narrow my search.
This is
much simpler than needing to know what metrics matter for any given service, about which I might know nothing. And if everyone published dashboards and just built what they thought was best, I'd get highly variable results: some services would have no dashboard, others would have complicated and confusing ones filled with pointless metrics, etc.
You get the point: RED can serve as a very useful minimalistic set of metrics, and if universally and consistently applied, can dramatically simplify the process of navigating, understanding, and diagnosing complex, distributed, interdependent systems like microservices APIs.