Sql-server – When can SARGable predicates be pushed into a CTE or derived table

optimizationsql server

Sandbag

While working on Top Quality Blog Posts®, I came across some optimizer behavior I found really ~~infuriating~~ interesting. I don't immediately have an explanation, at least not one I'm happy with, so I'm putting it here in case someone smart shows up.

If you want to follow along, you can grab the 2013 version of the Stack Overflow data dump here. I'm using the Comments table, with one additional index on it.

CREATE INDEX [ix_ennui] ON [dbo].[Comments] ( [UserId], [Score] DESC );

Query One

When I query the table like so, I get an odd query plan.

WITH x
    AS
     (
         SELECT   TOP 101
                  c.UserId, c.Text, c.Score
         FROM     dbo.Comments AS c
         ORDER BY c.Score DESC
     )
SELECT *
FROM   x
WHERE  x.Score >= 500;

The SARGable predicate on Score isn't pushed inside the CTE. It's in a filter operator much later in the plan.

Which I find odd, since the ORDER BY is on the same column as the filter.

Query Two

If I change the query, it does get pushed.

WITH x
    AS
     (
         SELECT   c.UserId, c.Text, c.Score
         FROM     dbo.Comments AS c
     )
SELECT TOP 101 *
FROM   x
WHERE  x.Score >= 500
ORDER BY x.Score DESC;

The query plan changes, too, and runs much faster, with no spill to disk. They both produce the same results, with the predicate at the nonclustered index scan.

Query Three

This is the equivalent of writing the query like so:

SELECT   TOP 101
         c.UserId, c.Text, c.Score
FROM     dbo.Comments AS c
WHERE c.Score >= 500
ORDER BY c.Score DESC;

Query Four

Using a derived table gets the same "bad" query plan as the initial CTE query

SELECT *
FROM   (   SELECT   TOP 101
                    c.UserId, c.Text, c.Score
           FROM     dbo.Comments AS c
           ORDER BY c.Score DESC ) AS x
WHERE x.Score >= 500;

Things get even weirder when…

I change the query to order the data ascending, and the filter to <=.

To keep from making this question overlong, I'm going to put everything together.

Queries

--Derived table
SELECT *
FROM   (   SELECT   TOP 101
                    c.UserId, c.Text, c.Score
           FROM     dbo.Comments AS c
           ORDER BY c.Score ASC ) AS x
WHERE x.Score <= 500;


--TOP inside CTE
WITH x
    AS
     (
         SELECT   TOP 101
                  c.UserId, c.Text, c.Score
         FROM     dbo.Comments AS c
         ORDER BY c.Score ASC
     )
SELECT *
FROM   x
WHERE  x.Score <= 500;


--Written normally
SELECT   TOP 101
         c.UserId, c.Text, c.Score
FROM     dbo.Comments AS c
WHERE c.Score <= 500
ORDER BY c.Score ASC;

--TOP outside CTE
WITH x
    AS
     (
         SELECT   c.UserId, c.Text, c.Score
         FROM     dbo.Comments AS c
     )
SELECT TOP 101 *
FROM   x
WHERE  x.Score <= 500
ORDER BY x.Score ASC;

Plans

Plan link.

Note that none of these queries take advantage of the nonclustered index — the only thing that changes here is the position of the filter operator. In no case is the predicate pushed to the index access.

A Question Appears!

Is there a reason that a SARGable predicate can be pushed in some scenarios and not in others? The differences within the queries sorted in descending order are interesting, but the differences between those and the ones that are ascending bizarre.

For anyone interested, here are the plans with only an index on Score:

DESC
ASC

Best Answer

There are a few issues in play here.

Pushing predicates past `TOP`

The optimizer cannot currently push a predicate past a TOP, even in the limited cases where it would be safe to do so*. This limitation accounts for the behaviour of all the queries in the question where the predicate is in a higher scope than the TOP.

The work around is to perform the rewrite manually. The fundamental issue is similar to the case of pushing predicates past a window function, except there is no corresponding specialized rule like SelOnSeqPrj.

My personal opinion is that an exploration rule like SelOnTop remains unimplemented because people have deliberately written queries with TOP in an effort to provide a kind of 'optimization fence'.

_{* Generally this means the predicate should appear in the ORDER BY clause associated with the TOP, and the direction of any inequality should agree with the direction of the sorting. The transformation would also need to account for the sorting behaviour of NULLs in SQL Server. Overall, the limitations probably mean this transformation would not be generally useful enough in practice to justify the additional exploration efforts.}

Costing issues

The remaining execution plans in the question can be explained as cost-based choices due to the distribution of values in the Score column (many more rows <= 500 than >= 500), and the effect of the row goal introduced by the TOP.

For example, the query:

--Written normally
SELECT TOP (101)
    c.UserId, 
    c.[Text],
    c.Score
FROM dbo.Comments AS c
WHERE
    c.Score <= 500
ORDER BY
    c.Score ASC;

...produces a plan with an apparently unpushed predicate in a Filter:

Note that the Sort is estimated to produce 101 rows. This is the effect of the row goal added by the Top. This affects the estimated cost of the Sort and the Filter enough to make it seem like this is the cheaper option. The estimated cost of this plan is 2401.39 units.

If we disable row goals with a query hint:

--Written normally
SELECT TOP (101)
    c.UserId, 
    c.[Text],
    c.Score
FROM dbo.Comments AS c
WHERE
    c.Score <= 500
ORDER BY
    c.Score ASC
OPTION (USE HINT ('DISABLE_OPTIMIZER_ROWGOAL'));

...the execution plan produced is:

The predicate has been pushed into the scan as a residual non-sargable predicate, and the cost of the whole plan is 2402.32 units.

Notice that the <= 500 predicate is not expected to filter out any rows. If you had chosen a smaller number, like <= 50, the optimizer would have preferred the pushed-predicate plan regardless of the row goal effect.

For the query with Score DESC and a Score >= 500 predicate:

--Written normally
SELECT TOP (101)
    c.UserId, 
    c.[Text],
    c.Score
FROM dbo.Comments AS c
WHERE
    c.Score >= 500
ORDER BY
    c.Score DESC;

Now the predicate is expected to be very selective, so the optimizer chooses to push the predicate and use the nonclustered index with lookups:

Again, the optimizer considered multiple alternatives and chose this as the apparently cheapest option, as usual.

Related Solutions

Sql-server – Cumulative Game Score SQL

Okay, so here is the query modified to work the way you want:

DECLARE @players table
(
    PlayerID uniqueidentifier NOT NULL PRIMARY KEY,
    PlayerName nvarchar(64) NOT NULL
);

DECLARE @playerScores table
(
    ID bigint NOT NULL IDENTITY PRIMARY KEY,
    PlayerID uniqueidentifier NOT NULL,
    DateCreated datetime NOT NULL,
    Score int NOT NULL,
    TimeTaken bigint NOT NULL,
    PuzzleID int NOT NULL
);

DECLARE @puzzleId int = 0;

SELECT TOP 50
    a.PlayerID,
    p.PlayerName,
    a.Score,
    a.TimeTaken,
    a.PlayedDate
    FROM
    (
        SELECT
            ps.PlayerID,
            ps.Score,
            ps.TimeTaken,
            ps.DateCreated AS PlayedDate,
            ROW_NUMBER()
                OVER
                (
                    PARTITION BY ps.PlayerID
                    ORDER BY ps.Score DESC, ps.TimeTaken, ps.DateCreated
                ) AS RN
            FROM @playerScores ps
            WHERE ps.PuzzleID = @puzzleId
    ) a
    INNER JOIN @players p ON p.PlayerID = a.PlayerID
    WHERE a.RN = 1
    ORDER BY
        a.Score DESC,
        a.TimeTaken,
        a.PlayedDate;

Having written this (note: indexes are not optimized), and looking at the other queries you're going to need to write, what I would actually recommend is to abandon this type of query entirely, and create a denormalized high-score table (rows are unique on the combination of PlayerID, PuzzleID), on which to run aggregates instead.

The reason why is because the GameResult table is going to grow huge in the database, and so it will be less and less efficient to run aggregates on it directly as time passes, and the requirements are incompatible with doing something like creating an indexed view to summarize the information.

Also, if you aren't doing this already, it's highly likely you'll want to use an asynchronous process to compute the "leaderboards" periodically and cache the results, instead of computing them just-in-time. (You could do something like merge the current player's score with the cached leaderboards so the player can see themself on the leaderboards immediately if they got a high score.) See my answer here for some ideas to consider when implementing a caching mechanism.

Sql-server – SHOWPLAN does not display a warning but “Include Execution Plan” does for the same query

This:

SET SHOWPLAN_XML ON;
GO
SELECT * FROM sys.objects;
GO

Is equivalent to pressing Display Estimated Execution Plan on the toolbar (or hitting Ctrl + L). You'll notice that no rows are returned from the query, like there is when you use Include Actual Execution Plan (Ctrl + M).

The spill warning is only a runtime warning. There is no way that SQL Server can know, when displaying the estimated plan, that a spill will happen at runtime. This is because a spill is caused by factors that might only be present during certain invocations of the query (for example, when there is memory pressure). The estimated plan knows roughly how much memory it's going to ask for, but it can't know until execution that it isn't going to get it.

As an aside, may I recommend* our free tool, SQL Sentry Plan Explorer? I think it provides much more obvious information than Management Studio. I recently wrote a lengthy blog post that can act as a tutorial, and Jonathan Kehayias has a great PluralSight course on it as well.

_{* Disclaimer: I work for SQL Sentry.}