What Is Cardinality in Data Modeling?
The first meaning of cardinality is when you’re designing the database—what’s called data modeling. In this sense, cardinality means whether a relationship is one-to-one, many-to-one, or many-to-many. So you’re really talking about the relationship cardinality.Watch the on-demand webcast: How to Conduct a Database Design Review
High and Low Database Cardinality Definition
The more important definition of cardinality for query performance is data cardinality. This is all about how many distinct values are in a column. We usually don’t talk about cardinality as a number, though. It’s more common to simply talk about “high” and “low” cardinality. A lot of distinct values is high cardinality; a lot of repeated values is low cardinality.Cardinality in Database Example
Picture a product description table in an e-commerce database:
ProductID
column is going to have high cardinality because it’s probably the primary key of the table, so it’s totally unique. If there’s a thousand rows in the table, there’ll be a thousand different ProductID
values.
The Category
column will have a lot of repetition, and it’ll have low or medium cardinality: maybe 50 or 100 different Category
values.
Name
probably has high cardinality, unless there’s more to this table than meets the eye (such as multiple rows for different product colors and other variations).
Cardinality in Time Series Databases
In addition to cardinality in databases, I also want to help simplify what it means to use cardinality in monitoring. If you’ve seen discussions of “high-cardinality dimensions” or “observability requires support for high-cardinality fields,” this is what we’re talking about. So what does it mean? Generally, this refers to the number of series in a time series database. A time series is a labeled set of values over time, stored as (timestamp, number) pairs. So, for example, you might measure CPU utilization and store it in a time series database:os.cpu.util = [(5:31, 82%), (5:32, 75%), (5:33, 83%)...]
This data model is the canonical starting point for most monitoring products. But it doesn’t contain a lot of richness: what if I have a lot of servers and want to know the average CPU utilization of, say, database servers versus web servers? How can I filter one kind versus the other? To solve this problem, many monitoring systems nowadays support tags with extra information. One way to conceptualize this is to make those data points N-dimensional instead of simply timestamps and numbers:
os.cpu.util = [(5:31, 82%, role=web), (5:32, 75%, role=web), (5:33, 83%, role=web)...]
This looks wasteful, doesn’t it? We’ve repeated “role=web” again and again, and we should be able to do it just once. Plus, most time series software typically tries to avoid N-dimensional storage because time-value pairs can be encoded efficiently—it’s much harder to build a database capable of storing these arbitrary name=value tags.
The typical time series monitoring software solves this by storing the tags with the series identifier, making it part of the identifier:
(name=os.cpu.util,role=web) = [(5:31, 82%), (5:32, 75%), (5:33, 83%)...]
But what if “role” changes over time? What if it’s not constant, even within a single server? Most existing time series software says, well, it’ll become a new series when a tag changes because the tag is part of the series identifier:
(name=os.cpu.util,role=web) = [(5:31, 82%), (5:32, 75%)] (name=os.cpu.util,role=db) = [(5:33, 83%)...]
When people talk about cardinality in monitoring and how it’s hard to handle high-cardinality dimensions, they’re basically talking about how many distinct combinations of tags there are, and thus the number of series. And there can be lots of tags, so there can be lots of combinations!
(name=os.cpu.util,role=web, datacenter=us-east1, ami=ami-5256b825, …) = [...]
Most of these tags are pretty static, but when one of the tags has high cardinality, it simply explodes the number of combinations of tags. A tag with medium cardinality, for example, could be a build identifier for a deployment artifact. High-cardinality tags would come from the workload itself: customer identifier. Session ID. Request ID. Remote IP address. That type of thing.
Most time series databases instantly crumble under these workloads because their data model and storage engine are optimized for storing points efficiently and aren’t optimized for lots of series. For example:

(name=os.cpu.util,role=web) = [(5:31, 82%, build_id=ZpPZ5khe)...]
But some of these tags are more special than others, so to speak. InfluxDB talks about which tags are “indexed”—which ones are part of the series ID and pre-grouped-by. In general, software with special and non-special tags like this usually has some restriction around operations on it: maybe you can’t filter by some tags, or maybe you can’t group by some tags, or so on. Druid is in the same vein as InfluxDB in this regard.
And this is where the focus of products and technologies suddenly becomes clear. Traditional time series software was designed with an internal-facing SysAdmin worldview in mind, where we inspected our own systems/servers and cardinality was naturally low. This is where RRD files came from. We looked inward to figure out whether our systems were working.
But now, in the age of observability, forward-thinking engineers (and vendors) are focused on measuring and understanding the workload or events. Workload (query/event) measurements are very high-cardinality data sets by nature.
What’s the job of a database? To run queries. How do you know if it’s working? Measure whether the queries are successful, fast, and correct! Don’t look at the CPU and disk utilization—it’s the wrong place to look.
So when someone mentions high cardinality in monitoring, why it’s important, and why it’s hard, what does it mean? I’ll summarize:
- Workload or event data is the right way to measure customers’ actions and experiences.
- System workload is many-dimensional data, not just one-dimensional values over time, and it has very high cardinality.
- Traditional time series databases were designed with a system-centric worldview and thus weren’t built to store or query workload data.
