I have a reporting table (about 1bn rows), and a tiny dimension table:
CREATE TABLE dbo.Sales_unpartitioned (
BusinessUnit int NOT NULL,
[Date] date NOT NULL,
SKU varchar(8) NOT NULL,
Quantity numeric(10, 2) NOT NULL,
Amount numeric(10, 2) NOT NULL,
CONSTRAINT PK_Sales_unpartitioned PRIMARY KEY CLUSTERED (BusinessUnit, [Date], SKU)
);
--- Demo data:
INSERT INTO dbo.Sales_unpartitioned
SELECT severity AS BusinessUnit,
DATEADD(day, message_id, '2000-01-01') AS [Date],
LEFT([text], 3) AS SKU,
1000.*RAND(CHECKSUM(NEWID())) AS Quantity,
10000.*RAND(CHECKSUM(NEWID())) AS Amount
FROM sys.messages
WHERE [language_id]=1033;
--- Artificially inflate statistics of demo data:
UPDATE STATISTICS dbo.Sales_unpartitioned WITH ROWCOUNT=1000000000;
--- Dimension table:
CREATE TABLE dbo.BusinessUnits (
BusinessUnit int NOT NULL,
SalesManager nvarchar(250) NULL,
PRIMARY KEY CLUSTERED (BusinessUnit)
);
INSERT INTO dbo.BusinessUnits (BusinessUnit)
SELECT DISTINCT BusinessUnit FROM dbo.Sales;
… to which I've added a reporting view used by an application for OLTP-style reporting.
CREATE OR ALTER VIEW dbo.SalesReport_unpartitioned
AS
SELECT bu.BusinessUnit,
s.[Date],
s.SKU,
s.Quantity,
s.Amount
FROM dbo.BusinessUnits AS bu
CROSS APPLY (
--- Regular sales
SELECT t.BusinessUnit, t.[Date], t.SKU, t.Quantity, t.Amount
FROM dbo.Sales_unpartitioned AS t
WHERE t.BusinessUnit=bu.BusinessUnit
AND t.SKU LIKE 'T%'
UNION ALL
--- This is a special reporting entry. We only
--- want to see today's row. In case of duplicates,
--- get the row with the first "SKU".
SELECT TOP (1) s.BusinessUnit, s.[Date], s.SKU, s.Quantity, s.Amount
FROM dbo.Sales_unpartitioned AS s
WHERE s.BusinessUnit=bu.BusinessUnit
AND s.[Date]=CAST(SYSDATETIME() AS date)
AND s.SKU LIKE 'S%'
ORDER BY s.BusinessUnit, s.[Date], s.SKU
) AS s
The idea is that the user application will query this view with a SELECT query that filters on a range of dates and one or more BusinessUnits. For this purpose, I've chosen a CROSS APPLY
pattern, so that the query can "loop" over each BusinessUnit, seek to a range of Date, and apply a residual filter on SKU.
Example app query:
DECLARE @from date='2021-01-01', @to date='2021-12-31';
SELECT *
FROM dbo.SalesReport_unpartitioned
WHERE BusinessUnit=16
AND [Date] BETWEEN @from AND @to
ORDER BY BusinessUnit, [Date], SKU;
I would expect a query plan that looks like this:
Desired plan
However, the plan turns out like this:
Actual plan
I expected SQL Server to do a "predicate pushdown" on the Date column, allowing the Clustered Index Seek to look for a single BusinessUnit and a range of Date, then apply a residual predicate on SKU. This works on the Seek in the "s" branch (the one with TOP
) – probably because it has a hard-coded Date predicate in the query – but not on the "t" branch.
However, on the "t" branch SQL Server only seeks to the specific BusinessUnit with a residual predicate on SKU, effectively retrieving all dates. Only at the end of the plan does it applies a Filter operator that filters on the Date column.
In a large table, this has a very significant performance penalty – you could end up reading 20 years of data from disk when all you're looking for is a week.
Things I've tried
Workarounds:
- Converting the view to an inline table valued function with @fromDate and @toDate parameters that filter the "s" and "t" queries will enable a Seek on (BusinessUnit, Date) as desired, but requires rewriting the app code.
- Moving the
UNION ALL
out of theCROSS APPLY
(fromCROSS APPLY (UNION)
toCROSS APPLY() UNION CROSS APPLY()
) will enable predicate pushdown. It makes one more seek on the BusinessUnit table, which is perfectly acceptable.
Fixes the Seek, but changes the results:
- Surprisingly, removing the
TOP (1)
andORDER BY
for the "s" query makes predicate pushdown work on "t", but can give return too many rows from "s". - Eliminating
UNION ALL
by either removing the "s" or "t" query will enable predicate pushdown, but generate incorrect results.
No change or not feasible:
- Replacing
TOP (1)
with aROW_NUMBER()
pattern does not change the Seek. - Changing the
CROSS APPLY
to a forcedINNER LOOP JOIN
fixes the Seek on "t", but actually changes "s" to a Scan instead, which is even worse. - Adding trace flag 8780 to allow the optimizer to work on a plan for longer does not change anything. The plan is already optimized FULL with no early termination.
A common thread seems to be that changing/simplifying the "s" query (removing TOP
, ORDER BY
) fixes the problem on the "t" query, which feels counter-intuitive to me.
What I'm looking for
I'm trying to understand if this is a shortcoming of the optimizer, if it's the result of a deliberate costing/optimization mechanism, or if I've simply overlooked something.
Best Answer
It's a little bit of all of those.
There's a lot going on in the query presented — too much really — so to avoid writing half a book about it, I am going to boil it down to the main element that is causing you not to get the plan you are after:
The optimizer does not push predicates down the inner side of an apply.
The rule that operates on relational selections (filters, predicates) above an apply is called, naturally enough,
SELonApply
. It performs the following logical substitution:It takes part(s) of a potentially complex selection involving both A and B, and pushes those it can down to the driving table A. No part of the selection is pushed to B. The part(s) of the selection that cannot be pushed down remain behind.
This might sound like a shocking oversight, and counter to experience. That's because it is not the full story.
The optimizer tries to convert an apply to the equivalent join early on in the compilation process (during simplification, before trivial plan and cost-based optimization). It is capable of pushing selections down either side of a join, where it is safe. That join may in turn be transformed into a physical apply during cost-based optimization.
The effect of all this is to make it seem like the optimizer pushed a predicate down the inner side of an apply:
Let me show you an example:
If you look carefully at the plan, you will see the predicate on T2 pushed to the inner side seek, and the nested loop join is an apply (it has outer references). This was only possible because the optimizer was able to rewrite the apply as a join initially, push the predicates, then transform back to an apply later on.
We can disable the apply-to-join transformation using undocumented trace flag 9114:
This means only
SELonApply
can be used, which only pushes to the driving table A:Notice the part of the selection on T2.c2 is 'stuck' above the apply, in a filter. The inner side seek is only on the FK/PK equality specified inside the apply.
The optimizer is built on relational principles. It appreciates a relational schema design, and queries that use relational constructs. Apply (lateral join) is a relatively new extension. The optimizer knows a lot more tricks with join than it does with apply, hence the early effort to rewrite.
When you use things like apply, or the (non-relational) Top, you are implicitly taking more responsibility for the final plan shape. In other words, you will more often have to express your query differently (as in your workaround) to get a good outcome.
My preference would be to use the inline table-valued function with explicit predicate placement. If I were to rewrite the view, I might go with:
For the provided test query:
The execution plan is:
The orange section is regular sales. The yellow section is for the special reporting entry.