Sql-server – Field order in a composite index order with high selectivity and low selectivity fields

indexnonclustered-indexsql server

I have a SQL Server table with over 3 billion rows. One of my query takes an extremely long time so I am considering optimizing it. The query looks like this:

SELECT [Enroll_Date]
      ,Count(*) AS [Record #]
      ,Count(Distinct UserID) AS [User #]
  FROM UserTable
  GROUP BY [Enroll_Date]

The [Enroll_Date] is a low selectivity column with less than 50 possible values, while the UserID column is a high selectivity column with more than 200 million distinct values. Based on my research I believe I should create a non-clustered composite index on these two columns, and in theory the high selectivity column should be the first column. But I am not sure in my case, would that work because I am using the low selectivity column in the group by clause.

This table has no clustered index.

Best Answer

As an alternative to @AaronBertrand's solution (if you can't or don't want to create an indexed view), I would recommend you to create an index on (Enroll_Date, UserID). If this type of question is very common on your table, this should probably even be your clustered index.

I would not generally recommend high-selectivity indexes as a general "best practice", but rather look at what index will give your query the best performance.

An index on (Enroll_Date, UserID) will give your query a highly optimized, non-blocking query plan with Stream Aggregates.

"Non-blocking" in this context means that the query doesn't need to buffer any significant amounts of data (like, for instance, a sort or hash aggregate would), which means it (a) starts returning rows immediately, and (b) consumes practically no working memory.

Related Solutions

Sql-server – Index with multiple leaf levels

"My question is how and when does SQL Server create multiple leaf levels (index_level 0) for the same index."

The 2008 R2 documentation for sys.dm_db_index_physical_stats includes a link to Table and Index Organization, which shows the following diagram:

Allocation Units Diagram

It describes the data that may be stored in each of the three possible allocation unit types:

Allocation Unit Type Descriptions

Your clustered index does contain three leaf levels, one per allocation unit type. For example:

CREATE TABLE dbo.Example
(
    example_id  integer PRIMARY KEY,
    lob_data    nvarchar(max) NULL,
    padding     varchar(8000) NULL,
    overflow    varchar(8000) NULL
);

INSERT dbo.Example
    (
    example_id,
    lob_data,
    padding,
    overflow
    )
VALUES
    (
    1,
    REPLICATE(CONVERT(nvarchar(max), N'X'), 8001),
    REPLICATE('Y', 4000),
    REPLICATE('Z', 6000)
    );

SELECT
    ddips.index_id,
    ddips.index_type_desc,
    ddips.alloc_unit_type_desc,
    ddips.index_level,
    ddips.avg_page_space_used_in_percent
FROM sys.dm_db_index_physical_stats
    (
        DB_ID(), 
        OBJECT_ID('dbo.Example'), 
        1, 
        1, 
        'DETAILED'
    ) AS ddips;

Output:

╔══════════╦═════════════════╦══════════════════════╦═════════════╦════════════════════════════════╗
║ index_id ║ index_type_desc ║ alloc_unit_type_desc ║ index_level ║ avg_page_space_used_in_percent ║
╠══════════╬═════════════════╬══════════════════════╬═════════════╬════════════════════════════════╣
║        1 ║ CLUSTERED INDEX ║ IN_ROW_DATA          ║           0 ║ 50.3953545836422               ║
║        1 ║ CLUSTERED INDEX ║ ROW_OVERFLOW_DATA    ║           0 ║ 74.3019520632567               ║
║        1 ║ CLUSTERED INDEX ║ LOB_DATA             ║           0 ║ 99.0239683716333               ║
╚══════════╩═════════════════╩══════════════════════╩═════════════╩════════════════════════════════╝

Your table contains large object (LOB) columns (MAX or old-style text, ntext or image types) and variable-length column definitions which allow individual rows to exceed the 8060 byte INROW limit.

For rows that exceed 8060 bytes, ROW_OVERFLOW_DATA allocation units will be created. This is often problematic for performance, since row data access requires following an off-page pointer to retrieve the overflowed data.

I would certainly look at the design of the table before worrying too much about how full the pages are on average. Whether you should be concerned about page fullness depends on which allocation unit it refers to.

Sql-server – How to improve performance if index is not Most SELECTIVE

Without having much detail, I can't recommend much.

One thing that does jump out at me is that it's very likely you can improve performance on the table by normalizing it! The presence of so many duplicated (so few unique) values in the columns you listed suggests that perhaps many others in the table are not normalized, as well. I'm suggesting making the Name column an int (or even smallint) with a foreign key to a Names table, and the Current_state column bit (or alternately a tinyint) with a foreign key to a WhateverStates table. You would have to, of course, change your data access code to deal with this indirection, but that is nothing more than the basic job relational database developers have always had to do.

Normalizing will reduce the number of bytes per row, increasing the number of rows per page, reducing the number of pages that have to be read to satisfy any particular query, helping performance across the board! Right now the columns given require likely close to 34 bytes each. After the change I suggest, those columns will only require 11 bytes each. Of course, I haven't seen your whole table--your rows may be so big that it doesn't matter.

What columns and datatypes are in the clustered index (if there is one)? This can radically affect the size of the nonclustered indexes, again affecting performance in exactly the way I described (rows per page).

When you do query based on non-selective columns such as Current_state, what other columns are always or almost always included? It may be okay for you to have a nonselective column in an index if the index also contains a more selective column (or that in conjunction with the less selective column is more highly selective). If on the other hand you generally query often for rows based on the single column Current_state = 'Pending', then you can add a filtered index:

CREATE INDEX IX_YourTable_Pending ON dbo.YourTable (ClusteredColumnsInOrder)
WHERE Current_State = 'Pending'; --SQL 2008 and up only

This technique could help you even when you also include other columns: you would want to put those in the index instead of the ClusteredColumnsInOrder columns I suggested (which was just a tricky way to not put any additional columns into the nonclustered index, since--remember, now--nonclustered indexes always have all the columns of the clustered index implicitly included). Or, if you only pull a very few other columns, you can make your nonclustered index cover the query by adding INCLUDE (AdditionalColumn1, AdditionalColumn2) so that the query engine doesn't have to go back to the clustered index to satisfy the query.

You haven't provided very much information such as full table schema, sample data, and sample queries, and without those it's going to be pretty hard to give you very specific advice about what to do.

One thing I can say, though, is that indiscriminately throwing indexes at the table may not improve things much and could in fact hurt performance of your system overall.

If the hints I have given you here don't seem to help much, then I recommend that you do come back with some of the additional info I mentioned so that we can do a better job of assisting you.

Best Answer

Related Solutions

Sql-server – Index with multiple leaf levels

Sql-server – How to improve performance if index is not Most SELECTIVE

Related Question