How much does an index need to narrow the results of a search in order to be useful

index

How much does an index need to narrow the results of a search in order to be useful in speeding up queries?

Some examples all across the spectrum:

  • A column for storing true/false values obviously has only two unique values.
  • A 'last name' column probably has many unique values (although it may not).
  • A primary key column has all unique values.

I think that the goal of an index is to quickly narrow a search to a few rows, and that therefore, the last case is the best, the second is OK, and the first is useless.

Am I correct? If so, roughly where is the line of usefulness? For example, if an index can narrow the results to 1% of the rows, is that useful? What about 10% or 25%?

Best Answer

The diversity of a column's data is known as selectivity. Selectivity is useful to know when determining whether an index will be useful, but it is not the only thing that determines the speed benefit. Other factors include how fast the storage the index is on compared to the table, how much of the table/index is already cached, how large the index is in comparison to the table, and several other things.

Knowing the data-type of the column does not necessarily help us determine how selective an index on the column will be. Even a column constrained to two values might use those values for only a few rows and have the remainder NULL. On the other hand, a column that could have many distinct values could have the same value in every single row. Even with your id column where all the rows would have unique values, if you are searching for rows with an id >= 10, the index probably wouldn't be useful even though it is highly selective.

You can't use selectivity alone to determine whether an index will be useful or not because even if it returns 100% of the rows, if the index includes all the data necessary for the query it will be faster than using the table. On the other hand, for a small table it may be faster to query the whole thing even if the row being sought only represents 1% of the total.

Determining what indexes should be created is less about looking at the table structure than it is looking at the important queries and what data they need to retrieve.