I will try and give an example – this is not my table structure – I'm simply trying to outline the issue in order to find a solution…
Person
Id, Name
BrothersNames
Id, Name
SistersNames
Id, Name
PersonBrothers (join table)
PersonId, BrotherNameId
PersonSisters (join table)
PersonId, SisterNameId
OK – so imagine this database holds every person from a small country. The database holds a record of the names of everyone's brothers and sisters (it does not map a person to their brother or sister – just their names) so that we can find out statistics about names.
Obviously lots of names are shared so the join tables normalise this for us.
What I want to do is take one user and find out the number of matches of brother's names and number of matches of sister's names with every other user in the system, then add those two matches together and order by that descending. So this would give us a list of users who have the most number of brothers and sister's names in common.
I'm really only interested in the top ten matches but I think I have to get the whole result set to work out the top ten matches.
Please note that in my actual data a person can have a million brothers or a milllion sisters. This is where I'm getting performance issues.
This is how I'm calculating the matches for brothers and I do the same for sisters
select p.id, matches
FROM Person p
LEFT JOIN
(
SELECT
COUNT(*) AS Matches,
pbn.PersonId
FROM PersonBrothersNames pbn
INNER JOIN Brothersnames bn on pbn.BrothernameId =bn.Id
inner join PersonBrothersName otherpbn on otherpbn.BrothernameId = bn.Id
WHERE pbn.PersonId= @PersonId and pbn.PersonId <> otherpbn.personid
GROUP BY pbn.PersonId
) As BrothersNamesJoin ON BrothersNamesJoin.Person = p.Id
Please let me know if I should specify more info…
I am using SQL Server 2008 but is probably platform agnostic..
Best Answer
If you don't really need zero-second actuality, you could just run your query time to time and cache the results.
If you still need to have real-time data on this (sacrificing insert performance), I would do this:
Since self-joins are not allowed in indexed views, you need to create two copies of each table:
Create an indexed view on their join:
Same for sisters.
You should manually maintain these tables to have same data (write a trigger, insert/update/delete both etc).
Now we can easily get pairs with the most brothers and most sisters:
All we need now is to fetch a greatest sum. Unfortunately, we cannot index a join of these views (it's a pure implementation flaw, there's no theoretical limitation for this).
So we need to do the following: the top pair cannot have less brothers than the top sis pair. Same holds for the sisters. So we have this query:
If the numbers of common brothers and sisters are correlated, this query will be very fast.
This solution has two drawbacks:
DML
performance: if you insert or delete a record for a name shared by million brothers, the indexed view will get 2M inserts or delete. This is the price you pay for real-time query: the kind of data you are asking for cannot be easily indexed.Persons with 0 brothers or 0 sisters will not be indexed. If there's a chance that top pair will not have brothers or sisters, you should amend the last query a little.