Sql-server – How Selective Should Index Be

indexperformanceperformance-tuningsql server

Is there a general Selectivity rule of when to apply Nonclustered index?

We know not to create an index on a bit column, 50/50. "Rows with 50/50 distribution, it might buy you very little performance gain " Index Bit Field in SQL Server

So how selective should a query be in SQL Server, before an index should be applied? Is there a general rule in SQL Server guidelines? At 25% average selectivity distribution in a column? 10% Selectivity?

This article is stating around 31%? How Selective Should Index Be?

Best Answer

Considering column selectivity only when deciding which columns to index ignores quite a bit of what indexes can do, and what they're generally good for.

For instance, you may have an identity or guid column that's incredibly selective -- unique, even -- but never gets used. In that case, who cares? Why index columns that queries don't touch?

Much less selective indexes, even BIT columns, can make useful, or useful parts of indexes. In some scenarios, very un-selective columns on large tables can benefit quite a bit from indexing when they need to be sorted on, or grouped by.

Joins

Take this query:

SELECT COUNT(*) AS records
FROM dbo.Users AS u
JOIN dbo.Posts AS p
ON u.Id = p.OwnerUserId;

Without a helpful index on OwnerUserId, this is our plan with a Hash Join -- which spills -- but that's secondary to the point.

With a helpful index -- CREATE INDEX ix_yourmom ON dbo.Posts (OwnerUserId); -- our plan changes.

Aggregates

Likewise, grouping operations can benefit from indexing.

SELECT   p.OwnerUserId, COUNT(*) AS records
FROM     dbo.Posts AS p
GROUP BY p.OwnerUserId;

Without an index:

With an index:

Sorts

Sorting data can be another sticking point in queries that indexes can help.

Without an index:

With our index:

Blocking

Indexes can also help avoid blocking pile-ups.

If we try to run this update:

UPDATE p
SET p.Score += 100
FROM dbo.Posts AS p
WHERE p.OwnerUserId = 22656;

And concurrently run this select:

SELECT *
FROM   dbo.Posts AS p
WHERE  p.OwnerUserId = 8;

They'll end up blocking:

With our index in place, the select finishes instantly without being blocked. SQL Server has a way to access the data it needs efficiently.

In case you're wondering (using the equation Kumar provided) the OwnerUserId column's selectivity is 0.0701539878296839478

Wrap it up

Don't just blindly index columns based on how selective they are. Design indexes that help your workload run efficiently. Using more selective columns as leading key columns is generally a good idea when you're searching for equality predicates, but can be less helpful when searching on ranges.

Related Solutions

Sql-server – Why is the secondary selective index not used when the where clause filters on `value()`

The declaration of singleton in the path expression of the index enforces that you can not add multiple <Number> elements but the XQuery compiler does not take that into consideration when interpreting the expression in the value() function. You have to specify [1] to make SQL Server happy. Using typed XML with a schema does not help with that either. And because of that SQL Server builds a query that uses something that could be called an "apply" pattern.

Easiest to demonstrate is to use regular tables instead of XML simulating the query we are actually executing against T and the internal table.

Here is the setup for the internal table as a real table.

create table dbo.xml_sxi_table
(
  pk1 int not null,
  row_id int,
  path_1_id varbinary(900),
  pathSQL_1_sql_value int,
  pathXQUERY_2_value float
);

go

create clustered index SIX_T on xml_sxi_table(pk1, row_id);
create nonclustered index SIX_pathSQL on xml_sxi_table(pathSQL_1_sql_value) where path_1_id is not null;
create nonclustered index SIX_T_pathXQUERY on xml_sxi_table(pathXQUERY_2_value) where path_1_id is not null;

go

insert into dbo.xml_sxi_table(pk1, row_id, path_1_id, pathSQL_1_sql_value, pathXQUERY_2_value)
select T.ID, 1, T.ID, T.ID, T.ID
from dbo.T;

With both tables in place you can execute the equivalent of the exist() query.

select count(*)
from dbo.T
where exists (
             select *
             from dbo.xml_sxi_table as S
             where S.pk1 = T.ID and
                   S.pathXQUERY_2_value = 314 and
                   S.path_1_id is not null
             );

The equivalent of the value() query would look like this.

select count(*)
from dbo.T
where (
      select top(1) S.pathSQL_1_sql_value
      from dbo.xml_sxi_table as S
      where S.pk1 = T.ID and
            S.path_1_id is not null
      order by S.path_1_id
      ) = 314;

The top(1) and order by S.path_1_id is the culprit and it is [1] in the Xpath expression that is to blame.

I don't think it is possible for Microsoft to fix this with the current structure of the internal table even if you were allowed to leave out the [1] from the values() function. They would probably have to create multiple internal tables for each path expression with unique constraints in place to guarantee for the optimizer that there can only be one <number> element for each row. Not sure that would actually be enough for the optimizer to "break out of the apply pattern".

For you who think this fun and interesting and since you are still reading this you probably are.

Some queries to look at the structure of the internal table.

select T.name, 
       T.internal_type_desc, 
       object_name(T.parent_id) as parent_table_name
from sys.internal_tables as T
where T.parent_id = object_id('T');

select C.name as column_name, 
       C.column_id,
       T.name as type_name,
       C.max_length,
       C.is_sparse,
       C.is_nullable
from sys.columns as C
  inner join sys.types as T
    on C.user_type_id = T.user_type_id
where C.object_id in (
                     select T.object_id 
                     from sys.internal_tables as T 
                     where T.parent_id = object_id('T')
                     )
order by C.column_id;

select I.name as index_name,
       I.type_desc,
       I.is_unique,
       I.filter_definition,
       IC.key_ordinal,
       C.name as column_name, 
       C.column_id,
       T.name as type_name,
       C.max_length,
       I.is_unique,
       I.is_unique_constraint
from sys.indexes as I
  inner join sys.index_columns as IC
    on I.object_id = IC.object_id and
       I.index_id = IC.index_id
  inner join sys.columns as C
    on IC.column_id = C.column_id and
       IC.object_id = C.object_id
  inner join sys.types as T
    on C.user_type_id = T.user_type_id
where I.object_id in (
                     select T.object_id 
                     from sys.internal_tables as T 
                     where T.parent_id = object_id('T')
                     );

Sql-server – Will a nonclustered index with a unique column always address all queries filtering that column first

Yes, given the constraints in the question, particularly that the primary key column is the leading column in the indexes. Also assuming the primary key never changes.
Not necessarily.

The optimizer can indeed infer uniqueness without marking the nonclustered index unique.

Marking the index unique may introduce a Split-Sort-Collapse combination in execution plans that change an index key. The extra Sort in particular has the potential to be performance-affecting.

On the other hand, not marking the index unique risks data integrity if the primary key is ever changed.

Example

CREATE TABLE dbo.Test
(
    pk integer PRIMARY KEY NONCLUSTERED,
    c1 integer NOT NULL,
    c2 integer NOT NULL
);

-- Not unique on pk, c1
CREATE NONCLUSTERED INDEX ic1 ON dbo.Test (pk, c1);

-- Unique on pk, c2
CREATE UNIQUE NONCLUSTERED INDEX ic2 ON dbo.Test (pk, c2);

Uniqueness

-- Neither plan has an aggregate
SELECT DISTINCT T.c1 FROM dbo.Test AS T WHERE T.pk = 1;
SELECT DISTINCT T.c2 FROM dbo.Test AS T WHERE T.pk = 1;

Split, Sort, Collapse

-- No split, sort, collapse
UPDATE dbo.Test SET c1 = CHECKSUM(NEWID());

-- Split, sort, collapse updating unique key
UPDATE dbo.Test SET c2 = CHECKSUM(NEWID());

Note the split-sort-collapse plan is also a wide (per-index) update.

Uniqueness is a huge topic though. I would normally mark something that is unique as unique, unless there is a good reason not to. Some further reading from my blog:

To anticipate comments about heap tables: Most tables benefit from being clustered. You need good reasons to choose a heap structure, especially from a space management point of view, if the table ever experiences deletes. Updates can also introduce performance impacts if columns expand beyond the space available on the original page (forwarded records).