Sql-server – Left Joining Tables with tables that have subquery aggregate

sql serversql-server-2012

I'm tring to list every personID in the person table while left joining it to the POSEligibility table that may not have a personID entry or if it does may have multiple entries as a personID is created for each startdate entry. New startdates are added each year for those eligible. Which is why i'm using the MAX Aggergate on startdate. I only want the most recent date. If I remove the subquery I get all my personIDs from the person table with NULL entries for those personIDs that don't have a startdate as expected. But I also get all the startdates that each personID has, since of course the Aggergate is missing. Any ideas?

SELECT
  per.personID
 ,id.lastName
 ,id.firstName
 ,id.middleName
 ,CONVERT(VARCHAR(10), pos1.startDate, 101) AS StartDate
 ,cal.name
 ,en.grade
 ,pos1.eligibility
FROM dbo.Person per
LEFT OUTER JOIN dbo.POSEligibility pos1
  ON per.personID = pos1.personID
INNER JOIN dbo.[Identity] id
  ON per.personID = id.personID
INNER JOIN dbo.Enrollment en
  ON per.personID = en.personID
INNER JOIN dbo.Calendar cal
  ON en.calendarID = cal.calendarID
INNER JOIN (SELECT
    pos2.personID
   ,MAX(pos2.startDate) AS startdate
  FROM dbo.POSEligibility pos2
  GROUP BY pos2.personID) pos2
  ON pos2.personID = pos1.personID
    AND pos2.startdate = pos1.startDate
WHERE en.grade = '01'
AND cal.name LIKE '%BES'
AND en.active = '1'
AND en.endYear = '2017'
GROUP BY per.personID
        ,pos1.startDate
        ,id.lastName
        ,id.firstName
        ,id.middleName
        ,cal.name
        ,en.grade
        ,pos1.eligibility

Best Answer

SELECT per.personID,
       id.lastName,
       id.firstName,
       id.middleName,
       pos.StartDate,
       cal.[name],
       en.grade,
       pos.eligibility
FROM dbo.Person per
     INNER JOIN dbo.[Identity] id ON per.personID = id.personID
     INNER JOIN dbo.Enrollment en ON per.personID = en.personID
     INNER JOIN dbo.Calendar cal ON en.calendarID = cal.calendarID
     OUTER APPLY
(
    SELECT TOP 1 p.eligibility,
                 StartDate = CONVERT( VARCHAR(10), p.startDate, 101)
    FROM dbo.POSEligibility AS p
    WHERE p.personID = per.personID
    ORDER BY p.startDate DESC
) AS pos
WHERE en.grade = '01'
      AND cal.[name] LIKE '%BES'
      AND en.active = '1'
      AND en.endYear = '2017';

Note that the filter on cal.[name] is not sargable. This means that even if that column is indexed, the index will not be used for a seek due to the leading "%" in the predicate.

Related Solutions

Sql-server – How to speed up a query that orders by a calculated field

If you don't really need zero-second actuality, you could just run your query time to time and cache the results.

If you still need to have real-time data on this (sacrificing insert performance), I would do this:

Since self-joins are not allowed in indexed views, you need to create two copies of each table:

CREATE TABLE personBrother
        (
        personId INT NOT NULL,
        brotherName INT NOT NULL
        )

CREATE TABLE personBrother2
        (
        personId INT NOT NULL,
        brotherName INT NOT NULL
        )

Create an indexed view on their join:

CREATE VIEW
        commonBrothers
WITH SCHEMABINDING
AS
        SELECT  p1.personId AS p1,
                p2.personId AS p2,
                COUNT_BIG(*) AS cnt
        FROM    dbo.personBrother p1
        JOIN    dbo.personBrother2 p2
        ON      p1.brotherName = p2.brotherName
        WHERE   p1.personId < p2.personId
        GROUP BY
                p1.personId, p2.personId

CREATE UNIQUE CLUSTERED INDEX
        ux_commonBrothers_p1_p2
ON      commonBrothers (p1, p2)

CREATE INDEX
        ix_commonBrothers_cnt
ON      commonBrothers (cnt)

Same for sisters.

You should manually maintain these tables to have same data (write a trigger, insert/update/delete both etc).

Now we can easily get pairs with the most brothers and most sisters:

SELECT  TOP 1 WITH TIES
        *
FROM    commonBrothers
ORDER BY
        cnt DESC

All we need now is to fetch a greatest sum. Unfortunately, we cannot index a join of these views (it's a pure implementation flaw, there's no theoretical limitation for this).

So we need to do the following: the top pair cannot have less brothers than the top sis pair. Same holds for the sisters. So we have this query:

SELECT  TOP 1 WITH TIES
        cb.p1, cb.p2, cb.cnt + cs.cnt AS totalCnt
FROM    commonBrothers cb
JOIN    commonSisters cs
ON      cs.p1 = cb.p1
        AND cs.p2 = cb.p2
WHERE   cs.cnt >=
        (
        SELECT  MAX(cst.cnt)
        FROM    (
                SELECT  TOP 1 WITH TIES
                        p1, p2
                FROM    commonBrothers 
                ORDER BY
                        cnt DESC
                ) cbt
        JOIN    commonSisters cst
        ON      cst.p1 = cbt.p1
                AND cst.p2 = cbt.p2
        )
        AND cb.cnt >=
        (
        SELECT  MAX(cbt.cnt)
        FROM    (
                SELECT  TOP 1 WITH TIES
                        p1, p2
                FROM    commonSisters
                ORDER BY
                        cnt DESC
                ) cst
        JOIN    commonBrothers cbt
        ON      cbt.p1 = cst.p1
                AND cbt.p2 = cst.p2
        )
ORDER BY
        totalCnt DESC

If the numbers of common brothers and sisters are correlated, this query will be very fast.

This solution has two drawbacks:

DML performance: if you insert or delete a record for a name shared by million brothers, the indexed view will get 2M inserts or delete. This is the price you pay for real-time query: the kind of data you are asking for cannot be easily indexed.
Persons with 0 brothers or 0 sisters will not be indexed. If there's a chance that top pair will not have brothers or sisters, you should amend the last query a little.

Sql-server – Why does selecting top 1 from composite index DESC also used to partition by month not select the top value

Basically seems to be picking the MAX id from a partition, not max across all partitions

Writing TOP (1) without an ORDER BY clause to define which row is 'top' means the query processor is logically free to return any row from the set. The query plan selected by the optimizer happens to return a particular row (highest id from the first partition) but you cannot rely on this, even if it were a useful result.

Whenever you use TOP you should always specify an ORDER BY at the same scope to produce deterministic behaviour - unless you really do not care which row(s) come back.

Given the table size SELECT MAX(id) FROM tableA will not perform well enough

The optimizer is lacking some logic to transform a scalar MAX or MIN aggregate over a partitioned index to a global aggregate over per-partition aggregates. Itzik Ben-Gan explains the limitation and provides a general workaround in this article.

If the highest partition number is known and guaranteed not to change, the workaround to specify a literal partition using the $partition function will work, though it may fail in a non-obvious way if the partitioning strategy changes in future.

This 'solution' works by eliminating all but one partition, resulting in a simple seek on one partition of the index.

Adding an order by id does not improve performance for some reason

The same optimizer limitation broadly applies to TOP (1) ... ORDER BY. The ORDER BY makes the result deterministic, but does not help produce a more efficient plan in this particular case (but see below).

Implied Index Keys

Your index is on id DESC, timeSampled DESC. In SQL Server 2008 and later, partitioning introduces an extra implied leading key on $partition ASC (it is always ascending, it is not configurable) making the full index key $partition ASC, id DESC, timeSampled DESC.

Since id and timeSampled increase together (though there is nothing in the schema to guarantee this) you could rewrite the query as TOP (1) ... ORDER BY $partition DESC, id DESC. Unfortunately, the DESC keys on your index and ASC implied leading key $partition means the index could not be used to scan just one row from the index in order.

If your index keys were instead id ASC, timeSampled ASC the whole index key would be $partition ASC, id ASC, timeSampled ASC. This all-ASC index could be scanned backward, returning just the first row in key order. This row would be guaranteed to have the highest id value in the highest-numbered partition. Given the (unenforced) relationship between id and partition id, this would produce the correct result with an optimal execution plan that reads just a single row.

This 'solution' lacks integrity because the id-timeSampled relationship is not enforced, and you probably do not want to rebuild the nonclustered primary key anyway. Nevertheless, I mention it because it may enhance your understanding of how partitioning interacts with indexes.

Best Answer

Related Solutions

Sql-server – How to speed up a query that orders by a calculated field

Sql-server – Why does selecting top 1 from composite index DESC also used to partition by month not select the top value

Implied Index Keys

Related Question