Databases have a lot of jargon, and cardinality
is one of those words that experienced people tend to forget that they didn’t know once upon a time. But if you don’t know it—and it takes a while to really get comfortable with cardinality—it’s super confusing when the DBA just drops it into the middle of a sentence without slowing down! Fear not: I’ve got you, as they say.
Cardinality means two things in databases. For our purposes, one matters a lot more than the other. Let’s do the simple one first, and then dig into the one that matters for query performance.
The definition of cardinality that matters a lot for query performance is data
cardinality. This is all about how many distinct values are in a column.The first meaning of cardinality is when you’re designing the database—what’s called data modeling. In this sense, cardinality means whether a relationship is one-to-one, many-to-one, or many-to-many. So you’re really talking about the relationship
Cardinality’s official, non-database dictionary definition is mathematical: the number of values in a set. When applied to databases, the meaning is a bit different: it’s the number of distinct values in a table column, relative to the number of rows in the table
. Repeated values in the column don’t count.
We usually don’t talk about cardinality as a number
, though. It’s more common to simply talk about “high” and “low” cardinality. A lot of distinct values is high cardinality; a lot of repeated values is low cardinality.
Picture a product description table in an e-commerce database:
column is going to be high-cardinality because it’s probably the primary key of that table, so it’s totally unique. If there’s a thousand rows in the table, there’ll be a thousand different
column will have a lot of repetition, and it’ll be low or medium cardinality: maybe 50 or 100 different
is probably high cardinality, unless there’s more to this table than meets the eye (such as multiple rows for different product colors and other variations).
Cardinality impacts performance a lot, because it influences the query execution plan. The planner will examine column statistics and use them to figure out how many values a query is likely to match, among other things. Depending on what it finds, it might use different query execution plans to try to get the best performance. But that’s a topic for a different blog post, because it takes a bit of work to explain.
So now, next time someone drops “high cardinality” into a sentence without pausing, you know they really mean “a lot of different values.” And there you have it!