Basically seems to be picking the MAX id from a partition, not max across all partitions
Writing TOP (1)
without an ORDER BY
clause to define which row is 'top' means the query processor is logically free to return any row from the set. The query plan selected by the optimizer happens to return a particular row (highest id from the first partition) but you cannot rely on this, even if it were a useful result.
Whenever you use TOP
you should always specify an ORDER BY
at the same scope to produce deterministic behaviour - unless you really do not care which row(s) come back.
Given the table size SELECT MAX(id) FROM tableA will not perform well enough
The optimizer is lacking some logic to transform a scalar MAX
or MIN
aggregate over a partitioned index to a global aggregate over per-partition aggregates. Itzik Ben-Gan explains the limitation and provides a general workaround in this article.
If the highest partition number is known and guaranteed not to change, the workaround to specify a literal partition using the $partition
function will work, though it may fail in a non-obvious way if the partitioning strategy changes in future.
This 'solution' works by eliminating all but one partition, resulting in a simple seek on one partition of the index.
Adding an order by id does not improve performance for some reason
The same optimizer limitation broadly applies to TOP (1) ... ORDER BY
. The ORDER BY
makes the result deterministic, but does not help produce a more efficient plan in this particular case (but see below).
Implied Index Keys
Your index is on id DESC, timeSampled DESC
. In SQL Server 2008 and later, partitioning introduces an extra implied leading key on $partition ASC
(it is always ascending, it is not configurable) making the full index key $partition ASC, id DESC, timeSampled DESC
.
Since id
and timeSampled
increase together (though there is nothing in the schema to guarantee this) you could rewrite the query as TOP (1) ... ORDER BY $partition DESC, id DESC
. Unfortunately, the DESC
keys on your index and ASC
implied leading key $partition
means the index could not be used to scan just one row from the index in order.
If your index keys were instead id ASC, timeSampled ASC
the whole index key would be $partition ASC, id ASC, timeSampled ASC
. This all-ASC
index could be scanned backward, returning just the first row in key order. This row would be guaranteed to have the highest id
value in the highest-numbered partition. Given the (unenforced) relationship between id and partition id, this would produce the correct result with an optimal execution plan that reads just a single row.
This 'solution' lacks integrity because the id-timeSampled relationship is not enforced, and you probably do not want to rebuild the nonclustered primary key anyway. Nevertheless, I mention it because it may enhance your understanding of how partitioning interacts with indexes.
On reason this can happen is that you're using local variables.
The problem is that this query takes so much time to go, despite all
of the indexes i've made on different columns.
Here's an example using a similar setup. In the Stack Overflow schema there's a narrow-ish table called Votes that looks like this.
With no index on CreationDate
, our only option would be to scan the Clustered Index. But if we create one only on CreationDate
, the optimizer can choose to use that index if it thinks doing a Key Lookup for the rest of the columns is cheaper than scanning the Clustered Index and applying a predicate.
CREATE INDEX ix_yourmom ON dbo.Votes(CreationDate)
For this query:
DECLARE @StartDate DATETIME = '2010-07-01';
DECLARE @EndDate DATETIME = '2010-07-02';
SELECT *
FROM dbo.Votes AS v
WHERE v.CreationDate BETWEEN @StartDate AND @EndDate;
GO
The cardinality estimate for unknown variables using between is 16.4317%. That leads to a clustered index scan and a missing index request for an index that covers the entire query.
If you run the query with RECOMPILE, you allow for the parameter embedding optimization.
DECLARE @StartDate DATETIME = '2010-07-01';
DECLARE @EndDate DATETIME = '2010-07-02';
SELECT *
FROM dbo.Votes AS v
WHERE v.CreationDate BETWEEN @StartDate AND @EndDate
OPTION ( RECOMPILE );
Which gives us a different query plan, and a more accurate estimate.
Hope this helps!
Best Answer
As a another option
I've done this using a CTE query as I find it easier to follow and explain.
The first CTE is just your
Ago
data.The second CTE ranks each row of the
Ago
data usingROW_NUMBER
based on the ordering of theTotalUnits
. I have include theDate
in the order so that if there are duplicateTotalUnits
the first occurrence will be picked.In the final query we are using
CASE
statements to pivot the data and aggregating the result for eachUCC12UPCCode
withMAX
to create a single row for eachUCC12UPCCode
.This query can be run over multiple
UCC12UPCCode
's