If you don't really need zero-second actuality, you could just run your query time to time and cache the results.
If you still need to have real-time data on this (sacrificing insert performance), I would do this:
Since self-joins are not allowed in indexed views, you need to create two copies of each table:
CREATE TABLE personBrother
(
personId INT NOT NULL,
brotherName INT NOT NULL
)
CREATE TABLE personBrother2
(
personId INT NOT NULL,
brotherName INT NOT NULL
)
Create an indexed view on their join:
CREATE VIEW
commonBrothers
WITH SCHEMABINDING
AS
SELECT p1.personId AS p1,
p2.personId AS p2,
COUNT_BIG(*) AS cnt
FROM dbo.personBrother p1
JOIN dbo.personBrother2 p2
ON p1.brotherName = p2.brotherName
WHERE p1.personId < p2.personId
GROUP BY
p1.personId, p2.personId
CREATE UNIQUE CLUSTERED INDEX
ux_commonBrothers_p1_p2
ON commonBrothers (p1, p2)
CREATE INDEX
ix_commonBrothers_cnt
ON commonBrothers (cnt)
Same for sisters.
You should manually maintain these tables to have same data (write a trigger, insert/update/delete both etc).
Now we can easily get pairs with the most brothers and most sisters:
SELECT TOP 1 WITH TIES
*
FROM commonBrothers
ORDER BY
cnt DESC
All we need now is to fetch a greatest sum. Unfortunately, we cannot index a join of these views (it's a pure implementation flaw, there's no theoretical limitation for this).
So we need to do the following: the top pair cannot have less brothers than the top sis pair. Same holds for the sisters. So we have this query:
SELECT TOP 1 WITH TIES
cb.p1, cb.p2, cb.cnt + cs.cnt AS totalCnt
FROM commonBrothers cb
JOIN commonSisters cs
ON cs.p1 = cb.p1
AND cs.p2 = cb.p2
WHERE cs.cnt >=
(
SELECT MAX(cst.cnt)
FROM (
SELECT TOP 1 WITH TIES
p1, p2
FROM commonBrothers
ORDER BY
cnt DESC
) cbt
JOIN commonSisters cst
ON cst.p1 = cbt.p1
AND cst.p2 = cbt.p2
)
AND cb.cnt >=
(
SELECT MAX(cbt.cnt)
FROM (
SELECT TOP 1 WITH TIES
p1, p2
FROM commonSisters
ORDER BY
cnt DESC
) cst
JOIN commonBrothers cbt
ON cbt.p1 = cst.p1
AND cbt.p2 = cst.p2
)
ORDER BY
totalCnt DESC
If the numbers of common brothers and sisters are correlated, this query will be very fast.
This solution has two drawbacks:
DML
performance: if you insert or delete a record for a name shared by million brothers, the indexed view will get 2M inserts or delete. This is the price you pay for real-time query: the kind of data you are asking for cannot be easily indexed.
Persons with 0 brothers or 0 sisters will not be indexed. If there's a chance that top pair will not have brothers or sisters, you should amend the last query a little.
Basically seems to be picking the MAX id from a partition, not max across all partitions
Writing TOP (1)
without an ORDER BY
clause to define which row is 'top' means the query processor is logically free to return any row from the set. The query plan selected by the optimizer happens to return a particular row (highest id from the first partition) but you cannot rely on this, even if it were a useful result.
Whenever you use TOP
you should always specify an ORDER BY
at the same scope to produce deterministic behaviour - unless you really do not care which row(s) come back.
Given the table size SELECT MAX(id) FROM tableA will not perform well enough
The optimizer is lacking some logic to transform a scalar MAX
or MIN
aggregate over a partitioned index to a global aggregate over per-partition aggregates. Itzik Ben-Gan explains the limitation and provides a general workaround in this article.
If the highest partition number is known and guaranteed not to change, the workaround to specify a literal partition using the $partition
function will work, though it may fail in a non-obvious way if the partitioning strategy changes in future.
This 'solution' works by eliminating all but one partition, resulting in a simple seek on one partition of the index.
Adding an order by id does not improve performance for some reason
The same optimizer limitation broadly applies to TOP (1) ... ORDER BY
. The ORDER BY
makes the result deterministic, but does not help produce a more efficient plan in this particular case (but see below).
Implied Index Keys
Your index is on id DESC, timeSampled DESC
. In SQL Server 2008 and later, partitioning introduces an extra implied leading key on $partition ASC
(it is always ascending, it is not configurable) making the full index key $partition ASC, id DESC, timeSampled DESC
.
Since id
and timeSampled
increase together (though there is nothing in the schema to guarantee this) you could rewrite the query as TOP (1) ... ORDER BY $partition DESC, id DESC
. Unfortunately, the DESC
keys on your index and ASC
implied leading key $partition
means the index could not be used to scan just one row from the index in order.
If your index keys were instead id ASC, timeSampled ASC
the whole index key would be $partition ASC, id ASC, timeSampled ASC
. This all-ASC
index could be scanned backward, returning just the first row in key order. This row would be guaranteed to have the highest id
value in the highest-numbered partition. Given the (unenforced) relationship between id and partition id, this would produce the correct result with an optimal execution plan that reads just a single row.
This 'solution' lacks integrity because the id-timeSampled relationship is not enforced, and you probably do not want to rebuild the nonclustered primary key anyway. Nevertheless, I mention it because it may enhance your understanding of how partitioning interacts with indexes.
Best Answer
Note that the filter on cal.[name] is not sargable. This means that even if that column is indexed, the index will not be used for a seek due to the leading "%" in the predicate.