Sql-server – SQL Server Indexed View and TOP

execution-planquery-performancesql serverview

I'm struggling to persuade a query plan to behave as I think it should. The addition of a TOP clause when querying an indexed view is causing a sub-optimal plan, and I'm hoping for some help in sorting it.

Environment

SQL Server 2019
StackOverflow2013 database (50GB version), Compat Mode 150 (problem is not specific to this version)

The setup:

Firstly, I've created a view to return everyone with a high reputation:

CREATE VIEW vwHighReputation
WITH SCHEMABINDING
AS
SELECT  [Id],
        [DisplayName],
        [Reputation]
FROM    [dbo].[Users]
WHERE   [Reputation] > 10000

Next, since I'll be searching by display name, I've created a couple of indexes on the view:

CREATE UNIQUE CLUSTERED INDEX IX_Users_Id ON [dbo].[vwHighReputation]([Id])
GO
CREATE NONCLUSTERED INDEX IX_Users_DisplayName ON [dbo].[vwHighReputation]([DisplayName]) INCLUDE (Reputation)
GO

If I query via the view, I can see my nonclustered index is being used:

SELECT  *
FROM    [dbo].[vwHighReputation]
WHERE   [DisplayName] LIKE 'J%'

Plan: (https://www.brentozar.com/pastetheplan/?id=Sy2EoJaiv)

So far so good. I can even use my view as part of a more complex query with an OUTER APPLY, and I still get a seek with only 63 reads against my index (this is obviously a contrived example, but helps illustrate the problem that I'll come to):

SELECT  [U].[Id],
        [A].[Reputation],
        [A].[DisplayName]
FROM    [dbo].[Users] AS [U]
        OUTER APPLY (
                        SELECT  *
                        FROM    [dbo].[vwHighReputation] AS [v]
                         WHERE   [v].[Id] = [U].[Id]
                    ) AS [A]
WHERE   [A].[DisplayName] LIKE 'J%';

Plan: https://www.brentozar.com/pastetheplan/?id=HJaw3y6ov

However, if I add a TOP 1 to my OUTER APPLY:

SELECT  [U].[Id],
        [A].[Reputation],
        [A].[DisplayName]
FROM    [dbo].[Users] AS [U]
        OUTER APPLY (
                        SELECT  TOP 1 *
                        FROM    [dbo].[vwHighReputation] AS [v]
                        WHERE   [v].[Id] = [U].[Id]
                    ) AS [A]
WHERE   [A].[DisplayName] LIKE 'J%';

Then the situation gets bad….very, very bad….

Plan: https://www.brentozar.com/pastetheplan/?id=HyOS6yaiw

My logical read count against that view is now almost 5 million. I can see from the plan that SQL Server is now choosing to perform a seek on the clustered index with the User's ID as the predicate, but doing so around 2.5 million times. It is also scanning the whole of the Users table. It no longer seeks on the view's index.

Obviously the optimiser is deciding that this is the most efficient approach, but I can't understand why! I've think it's probably to do with the way the underlying tables are sorted, but I'm not sure.

Incidentally, re-writing it as a simple SUB QUERY rather than CROSS APPLY yeilds the same outcome.

Any help or advice would be great!

Best Answer

Outer Apply

You're using OUTER APPLY, but with a where clause that would reject NULL values.

It's converted to an inner join without the TOP (1):

SELECT  
    U.Id,
    A.Reputation,
    A.DisplayName
FROM dbo.Users AS U
OUTER APPLY 
(
    SELECT  
        v.*
    FROM dbo.vwHighReputation AS v
    WHERE v.Id = U.Id
) AS A
WHERE A.DisplayName LIKE 'J%'
ORDER BY U.Id;

I've formatted your code a little bit, and added an ORDER BY to validate results across queries. No offense.

Outer Apply + TOP (1)

When you use the TOP (1), the join is of the LEFT OUTER variety:

SELECT  
    U.Id,
    A.Reputation,
    A.DisplayName
FROM dbo.Users AS U
OUTER APPLY 
(
    SELECT TOP (1)
        v.*
    FROM dbo.vwHighReputation AS v
    WHERE v.Id = U.Id
) AS A
WHERE A.DisplayName LIKE 'J%'
ORDER BY U.Id;

The TOP (1) inside the OUTER APPLY apparently makes the optimizer unable to apply the same transformation to an inner join, even with a redundant predicate:

SELECT  
    U.Id,
    A.Reputation,
    A.DisplayName
FROM dbo.Users AS U
OUTER APPLY 
(
    SELECT TOP (1)
        v.*
    FROM dbo.vwHighReputation AS v
    WHERE v.Id = U.Id
    AND   v.DisplayName LIKE 'J%'
) AS A
WHERE A.DisplayName LIKE 'J%'
ORDER BY U.Id;

Note the residual predicates to evaluate if the Id and DisplayName columns are NULL.

This isn't just a TOP (1) issue either -- you can substitute any values up to the big int max (9223372036854775807) and see the same plan.

It will also happen if you skip the view entirely.

SELECT  
    U.Id,
    A.Reputation,
    A.DisplayName
FROM dbo.Users AS U
OUTER APPLY 
(
    SELECT TOP (1)
        v.Id,
        v.DisplayName,
        v.Reputation
    FROM dbo.Users AS v
    WHERE v.Reputation > 10000 
    AND   v.Id = U.Id
) AS A
WHERE A.DisplayName LIKE 'J%'
ORDER BY U.Id
OPTION(EXPAND VIEWS);

A Rewrite

One way to get the same effect as TOP (1) without the various optimizer side effects of TOP is to use ROW_NUMBER

SELECT  
    U.Id,
    A.Reputation,
    A.DisplayName
FROM dbo.Users AS U
OUTER APPLY 
(
    SELECT
        v.*
    FROM
    (
        SELECT 
            v.*,
            ROW_NUMBER() OVER 
            (
                PARTITION BY 
                    v.Id
                ORDER BY
                    v.Id
            ) AS n
        FROM dbo.vwHighReputation AS v
    ) AS v
    WHERE v.Id = U.Id
    AND   v.n = 1
) AS A
WHERE A.DisplayName LIKE 'J%'
ORDER BY U.Id;

Which will get you the original plan:

Related Solutions

Sql-server – Different execution plans depending on columns selected from CTE

The plan without row number is below.

This is assigned a cost of 44.866.

You have a TOP without ORDER BY so SQL Server just needs to scan the clustered index and as soon as it finds the first 30,000 rows matching the predicate it can stop.

The table has 13,283,300 rows. A full clustered index scan is costed at 730.467 + 14.6118 = 745.0788 but this gets scaled down to 43.9392 because of the TOP.

Applying the same scaling of 5.9% to the number of rows in the table this would imply that SQL Server estimates that it will only have to scan 783,350 rows before it finds 30,000 matching the WHERE and can stop scanning.

NB: You say that only 474,296 rows match this predicate in the whole table but 508,747 are estimated to. That means that on average one in every 26.1 (13283300/508747) rows is assumed to match the filter. So it is estimated that 30,000 * 26.1 rows ( = 783K) will be read.

When you select * that means that the rownum column must be calculated. the plan for this is below. It is costed at 69.1185

You have an index on COLUMNE that can be seeked into. This satisfies the range predicate on COLUMNE >= 1472738400000 AND COLUMNE <= 1475244000000 and also supplies the required ordering for your row numbering.

However it does not cover the query and lookups are needed to return the missing columns. The plan estimates that there will be 30,000 such lookups. There may in fact be more as the predicate on COLUMNF = 1 may mean some rows are discarded after being looked up (though not in this case as you say COLUMNF always has a value of 1).

If the row numbering plan was to use a clustered index scan it would need to be a full scan followed by a sort of all rows matching the predicate. 69.1185 is considerably cheaper than the 745.0788 + sort cost so the plan with lookups is chosen.

You say that the plan with lookups is in fact 5 times faster than the clustered index scan. Likely a much greater proportion of the clustered index needed to be read to find 30,000 matching rows than was assumed in the costings. You are on SQL Server 2014 SP1 CU5. On SQL Server 2014 SP2 the actual execution plan now has a new attribute Actual Rows Read which would tell you how many rows it did actually read. On previous versions you can use OPTION (QUERYTRACEON 9130) to see the same information.

Sql-server – Row estimates always too low

(summarizing my comments and putting as answer)

A query rewrite will solve the issue of getting low row estimates. As Joe Chang explains in his blog post Query Optimizer Gone Wild - Full-Text

CONTAINS is "a predicte used in a WHERE clause" per Microsoft documentation, while CONTAINSTABLE acts as a table.

You get a much better plan (merge join) using CONTAINSTABLE vs the actual plan using contains uses a nested loop join with low row estimates.

You can rewrite the query as :

SELECT TOP 30 p.PersonId,
              p.PersonParentId,
              p.PersonName,
              p.PersonPostCode
FROM dbo.People p
left join containstable (ContactFullText, '"mr" AND "ch*"') cf on cf.[yourKey] = p.PersonId
WHERE p.PersonDeletionDate IS NULL
      AND p.PersonCustomerId = 24
      --AND CONTAINS(ContactFullText, '"mr" AND "ch*"')
      AND p.PersonGroupId IN(197, 206, 186, 198)
      AND [RANK] > 0
ORDER BY p.PersonParentId,
         p.PersonName;