SELECT * FROM PEOPLE WHERE FIRST=? AND LAST=?
The question-marks are placeholders for real values. Now, if there’s an index on (FIRST, LAST), how will this query work? We don’t know until we fill in the placeholders.
If we fill in (“Baron”, “Schwartz”), it’ll find very few rows. I think there might be a couple more people with my name in the U.S., but there aren’t very many. (Fun fact: I’ve only met one other person named Baron in my life: Baron Wormser, formerly the poet laureate of Maine.)
But if we execute the query with (“John”, “Smith”) it’s going to match a lot of rows. According to howmanyofme.com, it’ll match about 47,000 rows! If we’re looking for a specific John Smith, that’s a lot of results to sift through to find the one we want.
It could be worse, though: we could be searching for James, in an index that only includes the first name, leaving the last name un-indexed. There were 4.8 million people named James in the 1990 US Census records, out of a total of 249 million people—about 2% of the overall population and 3.3% of the male population. That’s a lot of rows to filter to find the one we want.
And that’s what databases do: if an index doesn’t narrow down the matching rows enough, the database will use the index to find a set of rows that match, and then scan and filter them to discard rows it doesn’t want. One of the most important purposes of an index is to narrow the initial candidate of matching rows as much as possible, reducing this work.
That’s where selectivity comes in. We’ve already been discussing selectivity, without naming it. Selectivity is a measure of how much the index narrows the search for values. There are two ways to express it:
- Average selectivity related to the cardinality. So if there are, say, 10,000 distinct first names in the United States (a number I’m having a hard time getting facts to support), the average selectivity of an index on the first name is 10,000/249,000,000=0.00004.
- The selectivity of a specific value is the number of rows with that value, divided by the total number of rows. The selectivity of “Baron Schwartz” is much better than “John Smith.”