SQL Server – Why Column Order in Non-Clustered Index Matters

sql serversql server 2014

I am working on a query for a table that contains movie tickets. The database holds 380k rows. A rows represents a showing of the movie (which cinema, when, how many tickets and at what price among other things).

I need to compute a few totals for every row: Admissions Paid, Admissions Revenue, Admissions Free and Total Admissions.

For a given row Admissions Paid is the sum of all tickets for that movie up until that point where price>0. The other 3 columns are computed similarly.

I wrote a query and created an index:

 SELECT [ID]
      ,[cinema_name]
      ,[movie_title]
      ,[price]
      ,[quantity]
      ,[start_date_time]
      ,* --I need all the columns for reporting
     ,(select SUM(quantity) 
        from [movies] i
        where i.movie_title=o.movie_title
        and i.start_date_time<=o.start_date_time
        and price=0) as [Admissions Free]
        ,(select SUM(quantity) 
        from [movies] i
        where i.movie_title=o.movie_title
        and i.start_date_time<=o.start_date_time
        and price>0) as [Admissions Paid]
        ,(select SUM(quantity*price) 
        from [movies] i
        where i.movie_title=o.movie_title
        and i.start_date_time<=o.start_date_time
        and price>0) as [Admissions Revenue]
        ,(select SUM(quantity) 
        from [movies] i
        where i.movie_title=o.movie_title
        and i.start_date_time<=o.start_date_time) as [Total Admissions]
  FROM [movies] o

I created the following index which brought the query time down to 5 minutes:

CREATE NONCLUSTERED INDEX [startdatetime_movietitle_price] ON [dbo].[movies]
(
    [movie_title] ASC,
    [start_date_time] ASC,
    [price] DESC
)
INCLUDE (   [quantity]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO

But this index brought the query time down to 1:30:

CREATE NONCLUSTERED INDEX [startdatetime_movietitle_price] ON [dbo].[movies]
(
    [start_date_time] ASC,
    [movie_title] ASC,
    [price] DESC
)
INCLUDE (   [quantity]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO

So my question is: why? From my understanding, it makes more sense to first gather all the movie titles and then look at the start times because there are more start times then there are movies. Distinct movies: 51, distinct start_date_times: 8786

Doesn't the underling B-Tree not cut off more branches if it eliminates the unnecessary start_date_times first?

Here are the execution plans:

The first picture shows the execution plan for the index with movie_title first, the other picture shows start_date_time first.

Best Answer

The first index does look like a better fit for the query. Please provide the actual execution plans.

I would try using window functions instead of the four correlated subqueries. Or a single correlated subquery (with OUTER APPLY) and see which of the two indexes is used.
Both ideas are to coerce the optimizer to use a single index scan to gather the rolling sums instead of 4 (that both your plans do).

It would also be worth checking and comparing the two execution plans, when asking for all the columns and when asking for only the columns in the index:

using window functions:

-- window functions
SELECT 
    -- m.*,
    movie_title, start_date_time,
    price, quantity,

    SUM(CASE WHEN price = 0 THEN quantity ELSE 0 END)
        OVER
        (PARTITION BY movie_title
         ORDER BY start_date_time
         RANGE BETWEEN UNBOUNDED PRECEDING
                   AND CURRENT ROW
       ) AS [Admissions Free],
    SUM(CASE WHEN price > 0 THEN quantity ELSE 0 END)
        OVER
        (PARTITION BY movie_title
         ORDER BY start_date_time
         RANGE BETWEEN UNBOUNDED PRECEDING
                   AND CURRENT ROW
       ) AS [Admissions Paid],
    SUM(CASE WHEN price > 0 THEN quantity * price ELSE 0 END)
        OVER
        (PARTITION BY movie_title
         ORDER BY start_date_time
         RANGE BETWEEN UNBOUNDED PRECEDING
                   AND CURRENT ROW
        ) AS [Admissions Revenue],
    SUM(quantity)
        OVER
        (PARTITION BY movie_title
         ORDER BY start_date_time
         RANGE BETWEEN UNBOUNDED PRECEDING
                   AND CURRENT ROW
        ) AS [Total Admissions]
FROM
    [movies] AS m ;

*: If there is a UNIQUE constraint on (movie_title, start_date_time), then you could use ROWS instead of RANGE for the window frames (it's usually more efficient). From the comments, there is no such constraint and there could be many rows with same title and datetime, so RANGE is required above.

using OUTER APPLY:

-- using OUTER APPLY
SELECT 
    -- m.*,
    m.movie_title, m.start_date_time,
    m.price, m.quantity,

    c.[Admissions Free],
    c.[Admissions Paid],
    c.[Admissions Revenue],
    c.[Total Admissions]
FROM
    [movies] AS m
    OUTER APPLY
    ( SELECT
          SUM(CASE WHEN i.price = 0 THEN i.quantity ELSE 0 END)
              AS [Admissions Free],
          SUM(CASE WHEN i.price > 0 THEN i.quantity ELSE 0 END)
              AS [Admissions Paid],
          SUM(CASE WHEN i.price > 0 THEN i.quantity * i.price ELSE 0 END)
              AS [Admissions Revenue],
          SUM(i.quantity)
              AS [Total Admissions]
      FROM [movies] AS i
      WHERE i.movie_title = o.movie_title
        AND i.start_date_time <= o.start_date_time
    ) AS c ;

This index may be a little better than the first one:

(
    movie_title ASC,
    start_date_time ASC
)
INCLUDE (price, quantity)

Related Solutions

Sql-server – Clustered index always better than Non-Clustered index

I just want to put in a word of warning: please very carefully pick your clustered index!

It's the most replicated data structure in your SQL Server database (assuming it's SQL Server you're talking about). The clustering key will be part of each and every non-clustered index on your table, too - certainly in the leaf level, possibly also in the index navigation structure.

You should use extreme care when picking a clustering key - it should be:

narrow (4 bytes ideal)
unique (it's the "row pointer" after all - if you don't make it unique, SQL Server will - for you - in the background - costs your a couple of bytes for each entry - times the number of rows and the number of nonclustered indices you have - can be very costly!)
static (never change - if possible)
ideally ever-increasing so you won't end up with horrible index fragmentation (a GUID is the total opposite of a good clustering key - for that particular reason)
it should be non-nullable and ideally also fixed with - a varchar(250) makes a very poor clustering key

Anything else should really be second and third level of importance behind these points ....

See some of Kimberly Tripp's (The Queen of Indexing) blog posts on the topic - anything she has written in her blog is absolutely invaluable - read it, digest it - live by it!

SQL Server Hierarchical Order – Parent-Child Tree Hierarchical ORDER BY

OK, enough brain cells are dead.

SQL Fiddle

WITH cte AS
(
  SELECT 
    [ICFilterID], 
    [ParentID],
    [FilterDesc],
    [Active],
    CAST(0 AS varbinary(max)) AS Level
  FROM [dbo].[ICFilters]
  WHERE [ParentID] = 0
  UNION ALL
  SELECT 
    i.[ICFilterID], 
    i.[ParentID],
    i.[FilterDesc],
    i.[Active],  
    Level + CAST(i.[ICFilterID] AS varbinary(max)) AS Level
  FROM [dbo].[ICFilters] i
  INNER JOIN cte c
    ON c.[ICFilterID] = i.[ParentID]
)

SELECT 
  [ICFilterID], 
  [ParentID],
  [FilterDesc],
  [Active]
FROM cte
ORDER BY [Level];

Best Answer

Related Solutions

Sql-server – Clustered index always better than Non-Clustered index

SQL Server Hierarchical Order – Parent-Child Tree Hierarchical ORDER BY

Related Question