SQL Server – Why Column Order in Non-Clustered Index Matters

sql serversql server 2014

I am working on a query for a table that contains movie tickets. The database holds 380k rows. A rows represents a showing of the movie (which cinema, when, how many tickets and at what price among other things).

I need to compute a few totals for every row: Admissions Paid, Admissions Revenue, Admissions Free and Total Admissions.

For a given row Admissions Paid is the sum of all tickets for that movie up until that point where price>0. The other 3 columns are computed similarly.

I wrote a query and created an index:

 SELECT [ID]
      ,[cinema_name]
      ,[movie_title]
      ,[price]
      ,[quantity]
      ,[start_date_time]
      ,* --I need all the columns for reporting
     ,(select SUM(quantity) 
        from [movies] i
        where i.movie_title=o.movie_title
        and i.start_date_time<=o.start_date_time
        and price=0) as [Admissions Free]
        ,(select SUM(quantity) 
        from [movies] i
        where i.movie_title=o.movie_title
        and i.start_date_time<=o.start_date_time
        and price>0) as [Admissions Paid]
        ,(select SUM(quantity*price) 
        from [movies] i
        where i.movie_title=o.movie_title
        and i.start_date_time<=o.start_date_time
        and price>0) as [Admissions Revenue]
        ,(select SUM(quantity) 
        from [movies] i
        where i.movie_title=o.movie_title
        and i.start_date_time<=o.start_date_time) as [Total Admissions]
  FROM [movies] o

I created the following index which brought the query time down to 5 minutes:

CREATE NONCLUSTERED INDEX [startdatetime_movietitle_price] ON [dbo].[movies]
(
    [movie_title] ASC,
    [start_date_time] ASC,
    [price] DESC
)
INCLUDE (   [quantity]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO

But this index brought the query time down to 1:30:

CREATE NONCLUSTERED INDEX [startdatetime_movietitle_price] ON [dbo].[movies]
(
    [start_date_time] ASC,
    [movie_title] ASC,
    [price] DESC
)
INCLUDE (   [quantity]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO

So my question is: why? From my understanding, it makes more sense to first gather all the movie titles and then look at the start times because there are more start times then there are movies. Distinct movies: 51, distinct start_date_times: 8786

Doesn't the underling B-Tree not cut off more branches if it eliminates the unnecessary start_date_times first?

Here are the execution plans:

desc

enter image description here

The first picture shows the execution plan for the index with movie_title first, the other picture shows start_date_time first.

Best Answer

The first index does look like a better fit for the query. Please provide the actual execution plans.

I would try using window functions instead of the four correlated subqueries. Or a single correlated subquery (with OUTER APPLY) and see which of the two indexes is used.
Both ideas are to coerce the optimizer to use a single index scan to gather the rolling sums instead of 4 (that both your plans do).

It would also be worth checking and comparing the two execution plans, when asking for all the columns and when asking for only the columns in the index:

using window functions:

-- window functions
SELECT 
    -- m.*,
    movie_title, start_date_time,
    price, quantity,

    SUM(CASE WHEN price = 0 THEN quantity ELSE 0 END)
        OVER
        (PARTITION BY movie_title
         ORDER BY start_date_time
         RANGE BETWEEN UNBOUNDED PRECEDING
                   AND CURRENT ROW
       ) AS [Admissions Free],
    SUM(CASE WHEN price > 0 THEN quantity ELSE 0 END)
        OVER
        (PARTITION BY movie_title
         ORDER BY start_date_time
         RANGE BETWEEN UNBOUNDED PRECEDING
                   AND CURRENT ROW
       ) AS [Admissions Paid],
    SUM(CASE WHEN price > 0 THEN quantity * price ELSE 0 END)
        OVER
        (PARTITION BY movie_title
         ORDER BY start_date_time
         RANGE BETWEEN UNBOUNDED PRECEDING
                   AND CURRENT ROW
        ) AS [Admissions Revenue],
    SUM(quantity)
        OVER
        (PARTITION BY movie_title
         ORDER BY start_date_time
         RANGE BETWEEN UNBOUNDED PRECEDING
                   AND CURRENT ROW
        ) AS [Total Admissions]
FROM
    [movies] AS m ;

*: If there is a UNIQUE constraint on (movie_title, start_date_time), then you could use ROWS instead of RANGE for the window frames (it's usually more efficient). From the comments, there is no such constraint and there could be many rows with same title and datetime, so RANGE is required above.

using OUTER APPLY:

-- using OUTER APPLY
SELECT 
    -- m.*,
    m.movie_title, m.start_date_time,
    m.price, m.quantity,

    c.[Admissions Free],
    c.[Admissions Paid],
    c.[Admissions Revenue],
    c.[Total Admissions]
FROM
    [movies] AS m
    OUTER APPLY
    ( SELECT
          SUM(CASE WHEN i.price = 0 THEN i.quantity ELSE 0 END)
              AS [Admissions Free],
          SUM(CASE WHEN i.price > 0 THEN i.quantity ELSE 0 END)
              AS [Admissions Paid],
          SUM(CASE WHEN i.price > 0 THEN i.quantity * i.price ELSE 0 END)
              AS [Admissions Revenue],
          SUM(i.quantity)
              AS [Total Admissions]
      FROM [movies] AS i
      WHERE i.movie_title = o.movie_title
        AND i.start_date_time <= o.start_date_time
    ) AS c ;

This index may be a little better than the first one:

(
    movie_title ASC,
    start_date_time ASC
)
INCLUDE (price, quantity)