How to achieve predicate pushdown in the view

optimizationquery-performancesql-server-2019

I have a reporting table (about 1bn rows), and a tiny dimension table:

CREATE TABLE dbo.Sales_unpartitioned (
    BusinessUnit    int NOT NULL,
    [Date]          date NOT NULL,
    SKU             varchar(8) NOT NULL,
    Quantity        numeric(10, 2) NOT NULL,
    Amount          numeric(10, 2) NOT NULL,
    CONSTRAINT PK_Sales_unpartitioned PRIMARY KEY CLUSTERED (BusinessUnit, [Date], SKU)
);

--- Demo data:
INSERT INTO dbo.Sales_unpartitioned
SELECT severity AS BusinessUnit,
       DATEADD(day, message_id, '2000-01-01') AS [Date],
       LEFT([text], 3) AS SKU,
       1000.*RAND(CHECKSUM(NEWID())) AS Quantity,
       10000.*RAND(CHECKSUM(NEWID())) AS Amount
FROM sys.messages
WHERE [language_id]=1033;

--- Artificially inflate statistics of demo data:
UPDATE STATISTICS dbo.Sales_unpartitioned WITH ROWCOUNT=1000000000;

--- Dimension table:
CREATE TABLE dbo.BusinessUnits (
    BusinessUnit    int NOT NULL,
    SalesManager    nvarchar(250) NULL,
    PRIMARY KEY CLUSTERED (BusinessUnit)
);

INSERT INTO dbo.BusinessUnits (BusinessUnit)
SELECT DISTINCT BusinessUnit FROM dbo.Sales;

… to which I've added a reporting view used by an application for OLTP-style reporting.

CREATE OR ALTER VIEW dbo.SalesReport_unpartitioned
AS

SELECT bu.BusinessUnit,
       s.[Date],
       s.SKU,
       s.Quantity,
       s.Amount
FROM dbo.BusinessUnits AS bu
CROSS APPLY (
    --- Regular sales
    SELECT t.BusinessUnit, t.[Date], t.SKU, t.Quantity, t.Amount
    FROM dbo.Sales_unpartitioned AS t
    WHERE t.BusinessUnit=bu.BusinessUnit
      AND t.SKU LIKE 'T%'

    UNION ALL

    --- This is a special reporting entry. We only
    --- want to see today's row. In case of duplicates,
    --- get the row with the first "SKU".
    SELECT TOP (1) s.BusinessUnit, s.[Date], s.SKU, s.Quantity, s.Amount
    FROM dbo.Sales_unpartitioned AS s
    WHERE s.BusinessUnit=bu.BusinessUnit
      AND s.[Date]=CAST(SYSDATETIME() AS date)
      AND s.SKU LIKE 'S%'
    ORDER BY s.BusinessUnit, s.[Date], s.SKU
) AS s

The idea is that the user application will query this view with a SELECT query that filters on a range of dates and one or more BusinessUnits. For this purpose, I've chosen a CROSS APPLY pattern, so that the query can "loop" over each BusinessUnit, seek to a range of Date, and apply a residual filter on SKU.

Example app query:

DECLARE @from date='2021-01-01', @to date='2021-12-31';

SELECT *
FROM dbo.SalesReport_unpartitioned
WHERE BusinessUnit=16
  AND [Date] BETWEEN @from AND @to
ORDER BY BusinessUnit, [Date], SKU;

I would expect a query plan that looks like this:

Desired plan

However, the plan turns out like this:

Actual plan

I expected SQL Server to do a "predicate pushdown" on the Date column, allowing the Clustered Index Seek to look for a single BusinessUnit and a range of Date, then apply a residual predicate on SKU. This works on the Seek in the "s" branch (the one with TOP) – probably because it has a hard-coded Date predicate in the query – but not on the "t" branch.

However, on the "t" branch SQL Server only seeks to the specific BusinessUnit with a residual predicate on SKU, effectively retrieving all dates. Only at the end of the plan does it applies a Filter operator that filters on the Date column.

In a large table, this has a very significant performance penalty – you could end up reading 20 years of data from disk when all you're looking for is a week.

Things I've tried

Workarounds:

Converting the view to an inline table valued function with @fromDate and @toDate parameters that filter the "s" and "t" queries will enable a Seek on (BusinessUnit, Date) as desired, but requires rewriting the app code.
Moving the UNION ALL out of the CROSS APPLY (from CROSS APPLY (UNION) to CROSS APPLY() UNION CROSS APPLY()) will enable predicate pushdown. It makes one more seek on the BusinessUnit table, which is perfectly acceptable.

Fixes the Seek, but changes the results:

Surprisingly, removing the TOP (1) and ORDER BY for the "s" query makes predicate pushdown work on "t", but can give return too many rows from "s".
Eliminating UNION ALL by either removing the "s" or "t" query will enable predicate pushdown, but generate incorrect results.

No change or not feasible:

Replacing TOP (1) with a ROW_NUMBER() pattern does not change the Seek.
Changing the CROSS APPLY to a forced INNER LOOP JOIN fixes the Seek on "t", but actually changes "s" to a Scan instead, which is even worse.
Adding trace flag 8780 to allow the optimizer to work on a plan for longer does not change anything. The plan is already optimized FULL with no early termination.

A common thread seems to be that changing/simplifying the "s" query (removing TOP, ORDER BY) fixes the problem on the "t" query, which feels counter-intuitive to me.

What I'm looking for

I'm trying to understand if this is a shortcoming of the optimizer, if it's the result of a deliberate costing/optimization mechanism, or if I've simply overlooked something.

Best Answer

I'm trying to understand if this is a shortcoming of the optimizer, if it's the result of a deliberate costing/optimization mechanism, or if I've simply overlooked something.

It's a little bit of all of those.

There's a lot going on in the query presented — too much really — so to avoid writing half a book about it, I am going to boil it down to the main element that is causing you not to get the plan you are after:

The optimizer does not push predicates down the inner side of an apply.

The rule that operates on relational selections (filters, predicates) above an apply is called, naturally enough, SELonApply. It performs the following logical substitution:

Sel (A Apply B) -> Sel (Sel A Apply B)

It takes part(s) of a potentially complex selection involving both A and B, and pushes those it can down to the driving table A. No part of the selection is pushed to B. The part(s) of the selection that cannot be pushed down remain behind.

This might sound like a shocking oversight, and counter to experience. That's because it is not the full story.

The optimizer tries to convert an apply to the equivalent join early on in the compilation process (during simplification, before trivial plan and cost-based optimization). It is capable of pushing selections down either side of a join, where it is safe. That join may in turn be transformed into a physical apply during cost-based optimization.

The effect of all this is to make it seem like the optimizer pushed a predicate down the inner side of an apply:

Written apply transformed to a join.
Predicate(s) pushed down either side of the join.
Join transformed to an apply.

Let me show you an example:

DECLARE @T1 table (pk integer PRIMARY KEY, c1 integer NOT NULL INDEX ic1);
DECLARE @T2 table (fk integer NOT NULL, c2 integer NOT NULL, PRIMARY KEY (fk, c2));

SELECT 
    T1.*,
    T2.*
FROM @T1 AS T1
CROSS APPLY 
(
    SELECT T2.* 
    FROM @T2 AS T2
    WHERE T2.fk = T1.pk
) AS T2
WHERE 
    1 = 1
    AND T1.c1 = 1
    AND T2.c2 = 2;

If you look carefully at the plan, you will see the predicate on T2 pushed to the inner side seek, and the nested loop join is an apply (it has outer references). This was only possible because the optimizer was able to rewrite the apply as a join initially, push the predicates, then transform back to an apply later on.

We can disable the apply-to-join transformation using undocumented trace flag 9114:

DECLARE @T1 table (pk integer PRIMARY KEY, c1 integer NOT NULL INDEX ic1);
DECLARE @T2 table (fk integer NOT NULL, c2 integer NOT NULL, PRIMARY KEY (fk, c2));

SELECT 
    T1.*,
    T2.*
FROM @T1 AS T1
CROSS APPLY 
(
    SELECT T2.* 
    FROM @T2 AS T2
    WHERE T2.fk = T1.pk
) AS T2
WHERE 
    1 = 1
    AND T1.c1 = 1
    AND T2.c2 = 2
OPTION (QUERYTRACEON 9114);

This means only SELonApply can be used, which only pushes to the driving table A:

Notice the part of the selection on T2.c2 is 'stuck' above the apply, in a filter. The inner side seek is only on the FK/PK equality specified inside the apply.

The optimizer is built on relational principles. It appreciates a relational schema design, and queries that use relational constructs. Apply (lateral join) is a relatively new extension. The optimizer knows a lot more tricks with join than it does with apply, hence the early effort to rewrite.

When you use things like apply, or the (non-relational) Top, you are implicitly taking more responsibility for the final plan shape. In other words, you will more often have to express your query differently (as in your workaround) to get a good outcome.

My preference would be to use the inline table-valued function with explicit predicate placement. If I were to rewrite the view, I might go with:

CREATE OR ALTER VIEW dbo.SalesReport_unpartitioned
AS
--- Regular sales
SELECT
    BU.BusinessUnit,
    RS.[Date],
    RS.SKU,
    RS.Quantity,
    RS.Amount
FROM dbo.BusinessUnits AS BU
JOIN dbo.Sales_unpartitioned AS RS
    ON RS.BusinessUnit = BU.BusinessUnit
WHERE 
    RS.SKU LIKE 'T%'

UNION ALL

--- This is a special reporting entry.
SELECT
    BU.BusinessUnit,
    SR.[Date],
    SR.SKU,
    SR.Quantity,
    SR.Amount
FROM dbo.BusinessUnits AS BU
JOIN dbo.Sales_unpartitioned AS SR
    ON SR.BusinessUnit = BU.BusinessUnit
WHERE
    1 = 1
    AND SR.SKU LIKE 'S%'
    --- We only want to see today's row.
    AND SR.[Date] = CONVERT(date, SYSDATETIME())
    --- In case of duplicates, get the row with the first "SKU".
    AND SR.SKU =
    (
        SELECT 
            MIN(SR2.SKU) 
        FROM dbo.Sales_unpartitioned AS SR2
        WHERE 
            SR2.BusinessUnit = SR.BusinessUnit
            AND SR2.[Date] = SR.[Date]
            AND SR2.SKU LIKE 'S%'
    );
GO

For the provided test query:

DECLARE @from date='2021-01-01', @to date='2021-12-31';

SELECT *
FROM dbo.SalesReport_unpartitioned
WHERE BusinessUnit=16
  AND [Date] BETWEEN @from AND @to
ORDER BY BusinessUnit, [Date], SKU;

The execution plan is:

The orange section is regular sales. The yellow section is for the special reporting entry.

Related Solutions

Sql-server – Help finding join without predicate

In my CTE I was missing a WITH (NOEXPAND) query hint. Once I added this query hint the join without predicate disappeared from my query plan.

;WITH AggregateStepData_CTE AS
(
    SELECT
        [UA].[UserId]
        , [UA].[DeviceId]
        , SUM(ISNULL([UA].[LatestSteps], 0)) AS [Steps]
    FROM [User].[UserStatus] [UA]
        INNER JOIN [User].[CurrentConnections] [M] WITH (NOEXPAND) -- Added query hint here
          ON [M].[Monitored] = [UA].[UserId] AND [M].[Monitor] = @UserId
    WHERE
        [M].[ShareSteps] = 1 -- Only use step data if we are allowed to see.
        AND
        CAST([UA].[ReportedLocalTime] AS DATE) = 
          CAST(DATEADD(MINUTE, DATEPART(TZOFFSET, [UA].[ReportedLocalTime]), @Now) AS DATE) 
          -- Aggregate the steps for today based on the device's time zone.         
    GROUP BY
        [UA].[UserId]
        , [UA].[DeviceId]
)

Sql-server – Difference between Seek Predicate and Predicate

Let's throw one million rows into a temp table along with a few columns:

CREATE TABLE #174860 (
PK INT NOT NULL, 
COL1 INT NOT NULL,
COL2 INT NOT NULL,
PRIMARY KEY (PK)
);

INSERT INTO #174860 WITH (TABLOCK)
SELECT RN
, RN % 1000
, RN % 10000
FROM 
(
    SELECT TOP 1000000 ROW_NUMBER () OVER (ORDER BY (SELECT NULL)) RN
    FROM   master..spt_values v1,
           master..spt_values v2
) t;

CREATE INDEX IX_174860_IX ON #174860 (COL1) INCLUDE (COL2);

Here I have a clustered index (by default) on the PK column. There's a nonclustered index on COL1 that has a key column of COL1 and includes COL2.

Consider the following query:

SELECT *
FROM #174860
WHERE PK >= 15000 AND PK < 15005
AND COL2 = 5000;

Here I'm not using BETWEEN because Aaron Bertrand is hanging around this question.

How should SQL Server optimizer that query? Well, I know that the filter on PK will reduce the result set to five rows. SQL server can use the clustered index to jump to those five rows instead of reading through all million rows in the table. However, the clustered index only has the PK column as a key column. Once the row is read into memory we need to apply the filter on COL2. Here, PK is a seek predicate and COL2 is a predicate.

SQL server finds five rows using the seek predicate and further reduces those five rows to one row with the normal predicate.

If I define the clustered index differently:

CREATE TABLE #174860 (
PK INT NOT NULL, 
COL1 INT NOT NULL,
COL2 INT NOT NULL,
PRIMARY KEY (COL2, PK)
);

And run the same query I get different results:

In this case, SQL Server can seek using both columns in the WHERE clause. Exactly one row is read from the table using the key columns.

For one more example consider this query:

SELECT *
FROM #174860
WHERE COL1 = 500
AND COL2 = 3545;

The IX_174860_IX index is a covering index because it contains all of the columns needed for the query. However, only COL1 is a key column. SQL Server can seek with that column to find the 1000 rows with a matching COL1 value. It can further filter down those rows on the COL2 column to reduce the final result set to 0 rows.

Best Answer

Related Solutions

Sql-server – Help finding join without predicate

Sql-server – Difference between Seek Predicate and Predicate

Related Question