I am working on a query for a table that contains movie tickets. The database holds 380k rows. A rows represents a showing of the movie (which cinema, when, how many tickets and at what price among other things).
I need to compute a few totals for every row: Admissions Paid
, Admissions Revenue
, Admissions Free
and Total Admissions
.
For a given row Admissions Paid
is the sum of all tickets for that movie up until that point where price>0
. The other 3 columns are computed similarly.
I wrote a query and created an index:
SELECT [ID]
,[cinema_name]
,[movie_title]
,[price]
,[quantity]
,[start_date_time]
,* --I need all the columns for reporting
,(select SUM(quantity)
from [movies] i
where i.movie_title=o.movie_title
and i.start_date_time<=o.start_date_time
and price=0) as [Admissions Free]
,(select SUM(quantity)
from [movies] i
where i.movie_title=o.movie_title
and i.start_date_time<=o.start_date_time
and price>0) as [Admissions Paid]
,(select SUM(quantity*price)
from [movies] i
where i.movie_title=o.movie_title
and i.start_date_time<=o.start_date_time
and price>0) as [Admissions Revenue]
,(select SUM(quantity)
from [movies] i
where i.movie_title=o.movie_title
and i.start_date_time<=o.start_date_time) as [Total Admissions]
FROM [movies] o
I created the following index which brought the query time down to 5 minutes:
CREATE NONCLUSTERED INDEX [startdatetime_movietitle_price] ON [dbo].[movies]
(
[movie_title] ASC,
[start_date_time] ASC,
[price] DESC
)
INCLUDE ( [quantity]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
But this index brought the query time down to 1:30:
CREATE NONCLUSTERED INDEX [startdatetime_movietitle_price] ON [dbo].[movies]
(
[start_date_time] ASC,
[movie_title] ASC,
[price] DESC
)
INCLUDE ( [quantity]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
So my question is: why? From my understanding, it makes more sense to first gather all the movie titles and then look at the start times because there are more start times then there are movies. Distinct movies
: 51, distinct start_date_times
: 8786
Doesn't the underling B-Tree not cut off more branches if it eliminates the unnecessary start_date_times
first?
Here are the execution plans:
The first picture shows the execution plan for the index with movie_title
first, the other picture shows start_date_time
first.
Best Answer
The first index does look like a better fit for the query. Please provide the actual execution plans.
I would try using window functions instead of the four correlated subqueries. Or a single correlated subquery (with
OUTER APPLY
) and see which of the two indexes is used.Both ideas are to coerce the optimizer to use a single index scan to gather the rolling sums instead of 4 (that both your plans do).
It would also be worth checking and comparing the two execution plans, when asking for all the columns and when asking for only the columns in the index:
using window functions:
*: If there is a
UNIQUE
constraint on(movie_title, start_date_time)
, then you could useROWS
instead ofRANGE
for the window frames (it's usually more efficient). From the comments, there is no such constraint and there could be many rows with same title and datetime, soRANGE
is required above.using
OUTER APPLY
:This index may be a little better than the first one: