Sql-server – How to speed up a query that orders by a calculated field

sql server

I will try and give an example – this is not my table structure – I'm simply trying to outline the issue in order to find a solution…

Person
Id, Name

BrothersNames
Id, Name

SistersNames
Id, Name

PersonBrothers (join table)
PersonId, BrotherNameId

PersonSisters (join table)
PersonId, SisterNameId

OK – so imagine this database holds every person from a small country. The database holds a record of the names of everyone's brothers and sisters (it does not map a person to their brother or sister – just their names) so that we can find out statistics about names.

Obviously lots of names are shared so the join tables normalise this for us.

What I want to do is take one user and find out the number of matches of brother's names and number of matches of sister's names with every other user in the system, then add those two matches together and order by that descending. So this would give us a list of users who have the most number of brothers and sister's names in common.

I'm really only interested in the top ten matches but I think I have to get the whole result set to work out the top ten matches.

Please note that in my actual data a person can have a million brothers or a milllion sisters. This is where I'm getting performance issues.

This is how I'm calculating the matches for brothers and I do the same for sisters

select p.id, matches
FROM Person p
LEFT JOIN 
        (
            SELECT 
            COUNT(*) AS Matches,
            pbn.PersonId
            FROM PersonBrothersNames pbn
            INNER JOIN Brothersnames bn on pbn.BrothernameId =bn.Id
            inner join PersonBrothersName otherpbn on otherpbn.BrothernameId = bn.Id

            WHERE pbn.PersonId= @PersonId and pbn.PersonId <> otherpbn.personid
            GROUP BY  pbn.PersonId

        ) As BrothersNamesJoin ON BrothersNamesJoin.Person = p.Id

Please let me know if I should specify more info…
I am using SQL Server 2008 but is probably platform agnostic..

Best Answer

If you don't really need zero-second actuality, you could just run your query time to time and cache the results.

If you still need to have real-time data on this (sacrificing insert performance), I would do this:

Since self-joins are not allowed in indexed views, you need to create two copies of each table:

CREATE TABLE personBrother
        (
        personId INT NOT NULL,
        brotherName INT NOT NULL
        )

CREATE TABLE personBrother2
        (
        personId INT NOT NULL,
        brotherName INT NOT NULL
        )

Create an indexed view on their join:

CREATE VIEW
        commonBrothers
WITH SCHEMABINDING
AS
        SELECT  p1.personId AS p1,
                p2.personId AS p2,
                COUNT_BIG(*) AS cnt
        FROM    dbo.personBrother p1
        JOIN    dbo.personBrother2 p2
        ON      p1.brotherName = p2.brotherName
        WHERE   p1.personId < p2.personId
        GROUP BY
                p1.personId, p2.personId

CREATE UNIQUE CLUSTERED INDEX
        ux_commonBrothers_p1_p2
ON      commonBrothers (p1, p2)

CREATE INDEX
        ix_commonBrothers_cnt
ON      commonBrothers (cnt)

Same for sisters.

You should manually maintain these tables to have same data (write a trigger, insert/update/delete both etc).

Now we can easily get pairs with the most brothers and most sisters:

SELECT  TOP 1 WITH TIES
        *
FROM    commonBrothers
ORDER BY
        cnt DESC

All we need now is to fetch a greatest sum. Unfortunately, we cannot index a join of these views (it's a pure implementation flaw, there's no theoretical limitation for this).

So we need to do the following: the top pair cannot have less brothers than the top sis pair. Same holds for the sisters. So we have this query:

SELECT  TOP 1 WITH TIES
        cb.p1, cb.p2, cb.cnt + cs.cnt AS totalCnt
FROM    commonBrothers cb
JOIN    commonSisters cs
ON      cs.p1 = cb.p1
        AND cs.p2 = cb.p2
WHERE   cs.cnt >=
        (
        SELECT  MAX(cst.cnt)
        FROM    (
                SELECT  TOP 1 WITH TIES
                        p1, p2
                FROM    commonBrothers 
                ORDER BY
                        cnt DESC
                ) cbt
        JOIN    commonSisters cst
        ON      cst.p1 = cbt.p1
                AND cst.p2 = cbt.p2
        )
        AND cb.cnt >=
        (
        SELECT  MAX(cbt.cnt)
        FROM    (
                SELECT  TOP 1 WITH TIES
                        p1, p2
                FROM    commonSisters
                ORDER BY
                        cnt DESC
                ) cst
        JOIN    commonBrothers cbt
        ON      cbt.p1 = cst.p1
                AND cbt.p2 = cst.p2
        )
ORDER BY
        totalCnt DESC

If the numbers of common brothers and sisters are correlated, this query will be very fast.

This solution has two drawbacks:

DML performance: if you insert or delete a record for a name shared by million brothers, the indexed view will get 2M inserts or delete. This is the price you pay for real-time query: the kind of data you are asking for cannot be easily indexed.
Persons with 0 brothers or 0 sisters will not be indexed. If there's a chance that top pair will not have brothers or sisters, you should amend the last query a little.

Related Solutions

Sql-server – How to optimize a query that’s running slow on Nested Loops (Inner Join)

The problem appears to be in this part of the code:

JOIN category_link l on l.sku_id IN (SELECT value FROM #Ids) AND
(
    l.category_id = c4.category_id OR
    l.category_id = c5.category_id
)

or in join conditions is always suspicious. One suggestion is to split this into two joins:

JOIN category_link l1 on l1.sku_id in (SELECT value FROM #Ids) and l1.category_id = cr.category_id
left outer join
category_link l1 on l2.sku_id in (SELECT value FROM #Ids) and l2.category_id = cr.category_id

You then have to modify the rest of the query to handle this . . . coalesce(l1.sku_id, l2.sku_id) for instance in the select clause.

Sql-server – Find Duplicate Customers

To find rows where at least three out of four of those columns match you can use.

SELECT  D1.ID, D2.ID
FROM   DATA D1
       JOIN DATA D2
         ON D1.SSN = D2.SSN
            AND D1.ID > D2.ID
            AND 2 <= CASE
                       WHEN D1.FNAME = D2.FNAME THEN 1
                       ELSE 0
                     END +
                     CASE
                       WHEN D1.LNAME = D2.LNAME THEN 1
                       ELSE 0
                     END  +
                     CASE
                       WHEN D1.DOB = D2.DOB THEN 1
                       ELSE 0
                     END
UNION ALL
SELECT D1.ID, D2.ID
FROM   DATA D1
       JOIN DATA D2
         ON D1.DOB = D2.DOB
         AND  D1.FNAME = D2.FNAME
         AND  D1.LNAME = D2.LNAME
         AND D1.SSN <> D2.SSN
         AND D1.ID > D2.ID

The top branch gets all rows where the SSN are the same and at least 2 out of the three other columns are the same. The join on SSN is likely to be pretty selective in itself.

That just leaves one other possible three column combination left which is dealt with by the second branch.

Both branches of the UNION ALL have an equi join so it should perform better than a join with some complicated OR condition.

Best Answer

Related Solutions

Sql-server – How to optimize a query that’s running slow on Nested Loops (Inner Join)

Sql-server – Find Duplicate Customers

Related Question