Sql-server – Clustered index on a column with low selectivity

index-tuningsql-server-2012

I have a table with a lot of columns. Two lookup functions are performed: 1) lookup by using column name (this happens rarely, maybe twice a season), and 2) lookup by column date. The column name has a very high selectivity measure (almost all rows are distinct), whereas the date column has very low selectivity measure (only several distinct columns over a table size of 300,000 records)

I need to optimize the performance by creating an index. I wanted to create a clustered index on the date column, but it is my understanding that doing so will even slow down the performance of my queries given the low selectivity measure.

What should I do – Clustered index on date, or just clustered index on name (or no indexes at all) ? The name lookup happens rarely, but date lookup queries run all of the time.
Thanks a lot!

I'm using SQL Server 2012.

Best Answer

Low Selectivity

Here is a good quote from SqlServerCentral:

In general, a nonclustered index should be selective. That is, the values in the column should be fairly unique and queries that filter on it should return small portions of the table.

The reason for this is that key/RID lookups are expensive operations and if a nonclustered index is to be used to evaluate a query it needs to be covering or sufficiently selective that the costs of the lookups aren’t deemed to be too high.

If SQL considers the index (or the subset of the index keys that the query would be seeking on) insufficiently selective then it is very likely that the index will be ignored and the query executed as a clustered index (table) scan.

It is important to note that this does not just apply to the leading column. There are scenarios where a very unselective column can be used as the leading column, with the other columns in the index making it selective enough to be used.

Couple of things to consider when indexing:

When To Avoid Indexing

  • Indexes should not be used on small tables.
  • Tables that have frequent, large batch update or insert operations.
  • Indexes should not be used on columns that contain a high number of NULL values.
  • Columns that are frequently manipulated should not be indexed.
  • When you already have a lot of indexes (do not over index a table)
  • Avoid Nonclustered, Unfiltered Indexes on Columns that have few Distinct Values (Low Cardinality)

Effectiveness of Indexing

I'm going to quote another stack exchange post:

There are a few concepts and terms that are important to understand when dealing with indexes. Seeks, scans, and lookups are some of the ways that indexes will be utilized through select statements. Selectivity of key columns is integral to determining how effective an index can be.

A clustered index's key columns are called a clustering key. This is how records are made unique in the context of a clustered index. All nonclustered indexes will include the clustered key by default, in order to perform lookups when necessary. All indexes will be inserted to, updated to, or deleted from for every respective DML statement. That having been said, it is best to balance performance gains in select statements against performance hits in insert, delete, and update statements.

In order to determine how effective an index is, you must determine the selectivity of your index keys. Selectivity can be defined as a percentage of distinct records to total records. If I have a [person] table with 100 total records and the [first_name] column contains 90 distinct values, we can say that the [first_name] column is 90% selective. The higher the selectivity, the more efficient the index key. Keeping selectivity in mind, it is best to put your most selective columns first in your index key. Using my previous [person] example, what if we had a [last_name] column that was 95% selective? We would want to create an index with [last_name], [first_name] as the index key.

Personally

I index on how the table is being used. This includes the holistic view of all queries on the tables as well what queries are not taking advantage of indexes. As Jeremiah Peshka said in the same thread, if the percentage of missing indexes is high, then an index on how it's being used is likely needed.

Key Takeaway

The SQL query engine loves highly selective key columns. Index on how a table is being used. Keep It Simple Stupid (KISS).