What Is Cardinality in Monitoring?
I wrote a couple of “definitions and nuances” posts about terminology in databases recently (cardinality, selectivity), and today I want to write one about cardinality in monitoring, as opposed to cardinality in databases. If you’ve seen discussions of “high-cardinality dimensions” or “observability requires support for high-cardinality fields,” this is what we’re talking about today. So, what does it mean?
First, a quick recap: the term cardinality comes from math, where it means the number of things in a set. In databases, it means the number of distinct values in a table column. And… in monitoring? What does it mean there?
Generally, it refers to the number of series in a time series database. A time series is a labeled set of values over time, stored as (timestamp, number) pairs. So, for example, you might measure CPU utilization and store it in a time series database:
os.cpu.util = [(5:31, 82%), (5:32, 75%), (5:33, 83%)...]
This data model is the canonical starting point for most monitoring products. But it doesn’t contain a lot of richness: what if I have a lot of servers and I want to know the average CPU utilization of, say, database servers versus web servers? How can I filter one kind versus the other kind? To solve this problem, a lot of monitoring systems nowadays support tags with extra information. One way to conceptualize this is to make those data points N-dimensional instead of simply timestamps and numbers:
os.cpu.util = [(5:31, 82%, role=web), (5:32, 75%, role=web), (5:33, 83%, role=web)...]
That looks pretty wasteful, doesn’t it? We’ve repeated “
role=web” again and again, and we should be able to do it just once. Plus, most time series software typically tries to avoid N-dimensional storage because time-value pairs can be encoded really efficiently—it’s a lot harder to build a database that can store these arbitrary name=value tags.
So the typical time series monitoring software solves this by storing the tags with the series identifier, making it part of the identifier:
(name=os.cpu.util,role=web) = [(5:31, 82%), (5:32, 75%), (5:33, 83%)...]
But what if “role” changes over time? What if it’s not constant, even within a single server? Most existing time series software says, well, it’ll become a new series when a tag changes, because the tag is part of the series identifier:
(name=os.cpu.util,role=web) = [(5:31, 82%), (5:32, 75%)]
(name=os.cpu.util,role=db) = [(5:33, 83%)...]
When people talk about cardinality in monitoring, and how it’s hard to handle high-cardinality dimensions, they’re basically talking about how many distinct combinations of tags there are, and thus the number of series. And there can be lots of tags, so there can be lots of combinations of them!
(name=os.cpu.util,role=web, datacenter=us-east1, ami=ami-5256b825, …) = [...]
Most of those tags are pretty static, but when one of the tags has high cardinality, it simply explodes the number of combinations of tags. A tag that might have medium cardinality, for example, would be a build identifier for a deployment artifact. High cardinality tags would come from the workload itself: Customer identifier. Session ID. Request ID. Remote IP address. That type of thing.
Most time series databases instantly crumble under these workloads, because their data model and storage engine is optimized for storing points efficiently, and not optimized for lots of series. For example:
Some of the more modern time series databases are built for a lot of series. InfluxDB is an example. They’ve put a lot of work into handling tons of series, and have quite a bit in their documentation about how InfluxDB deals with high cardinality.
But that’s all about storage, and whether the time series database can handle storing lots of series. What about retrieval? Can the time series database handle arbitrary queries against its data, without regard to the nature and cardinality?
Typical time series databases can’t, because they’re built around and designed to operate within the constraints of series. Think of a series as a “lane” of data: typical time series databases can only swim within that lane when they run a query, they can’t swim perpendicular to the lanes. This is because a series is a pre-aggregation of the original source data. The series identifier is what all the values in that series have in common, and when the data was serialized (so to speak) it was aggregated around the series identifier, at write time. The series ID has exactly the same function as GROUP BY fields in a SQL database. But unlike using GROUP BY in SQL, typical time series databases do the GROUP BY when they ingest the data. And once grouped, the data can’t ever be ungrouped again.
Some databases, again InfluxDB for example, have both tags-as-series-identifier and multi-dimensional values. It’s sort of like this:
(name=os.cpu.util,role=web) = [(5:31, 82%, build_id=ZpPZ5khe)...]
But some of those tags are more special than others, so to speak. InfluxDB talks about which tags are “indexed”—which ones are part of the series ID, and pre-grouped-by. In general, software with special and non-special tags like this usually has some restriction around operations on it: maybe you can’t filter by some tags, or maybe you can’t group by some tags, or so on. Druid is in the same vein as InfluxDB in this regard.
And this is where the focus of products and technologies suddenly becomes clear. Traditional time series software was designed with an internal-facing sysadmin worldview in mind, where we inspected our own systems/servers, and cardinality was naturally low. This is where RRD files came from. We looked inwards to figure out if our systems were working.
But now, in the age of observability, forward-thinking engineers (and vendors) are focused on measuring and understanding the workload or events. Workload (query/event) measurements are very high-cardinality datasets by nature.
What’s the job of a database? To run queries. How do you know if it’s working? Measure whether the queries are successful, fast, and correct! Don’t look at the CPU and disk utilization, that’s the wrong place to look. This philosophy—deprioritize the app, prioritize inspecting quality of service—is also foundational for Honeycomb, a next-generation observability platform:
Do you handle high-cardinality dimensions? If no, you are not doing observability. Exploring your systems and looking for common characteristics requires support for high-cardinality fields as a first-order group-by entity. It’s theoretically possible there are systems where the ability to group by things like user, request ID, shopping cart ID, source IP etc. is not necessary, but I’ve never seen one.
And it informs the monitoring and observability mindset of Mark McBride at Turbine Labs, too:
Stepping back to consider your service from a customer viewpoint simplifies things. Customers don’t care about your services’ internal details, they care about the quality of their experience.
(Mark’s talk at Velocity is on YouTube; highly recommended viewing).
So when someone mentions high-cardinality in monitoring, and why it’s important, and why it’s hard, what does it mean? I’ll summarize:
- Workload or event data is the right way to measure customers’ actions and experiences.
- System workload is many-dimensional data, not just one-dimensional values over time; and very high-cardinality.
- Traditional time series databases were designed with a system-centric worldview and thus weren’t architected to store or query workload data.
- Using traditional tools to measure, inspect, and troubleshoot customers’ experiences is basically impossible because of pre-aggregation and cardinality limitations, and leads engineers to focus on what the tool can offer them—which is often the wrong place to look.
I want to close by drawing attention again to something Mark McBride’s blog post and talk lays out so elegantly. In the age of service-oriented architectures (microservices, if you must), every team and service has customers. Every one. Maybe the customers are other services, not external users. But focus on the quality of service it’s providing, as measured by correctness, speed, and consistency—and you’ll find the problems a lot faster.