I have two tables (along with a nonclustered index) that can be created with the below commands:
CREATE TABLE GroupTable
(
GroupKey int NOT NULL PRIMARY KEY,
RecordCount int NOT NULL,
GroupScore float NOT NULL
);
CREATE TABLE RecordTable
(
RecordKey varchar(10) NOT NULL,
GroupKey int NOT NULL,
PRIMARY KEY(RecordKey, GroupKey)
);
CREATE UNIQUE INDEX ixGroupRecord ON RecordTable(GroupKey, RecordKey);
While technically my tables are slightly different and I am joining on a few other tables, this is a suitable proxy for my situation.
- I would like to select all
GroupKeys
that are not subsets of anotherGroupKey
. - For a given superset, I would like to grab the maximum
GroupScore
of all its subsets (including itself). - In the instance when a
GroupKey
contains the same exactRecordKeys
as anotherGroupKey(s)
, then only one of thoseGroupKeys
is grabbed (it doesn't matter which one). - Any
GroupKey
that has the same exactRecordKeys
as anotherGroupKey(s)
will also have the sameGroupScore
. - Non-related
GroupKeys
can have the same score as well.
The following is an example to illustrate what I am asking:
GroupTable RecordTable
GroupKey RecordCount GroupScore RecordKey GroupKey
------------------------------------ ---------------------
1 3 6.2 A 1
29 2 9.8 A 29
95 3 6.2 A 95
192 4 7.1 A 192
B 1
B 29
B 95
B 192
C 1
C 95
D 192
E 192
I would like the output to be the following:
GroupKey RecordCount GroupScore
-------------------------------------
1 3 9.8
192 4 9.8
GroupTable
has around 75M rows, and RecordTable
has around 115M rows; however, after the joins and WHERE
predicate, there tends to be around 20k rows on a given day.
I apologize if this question is trivial, but for some reason I'm really struggling with it.
Best Answer
Using correlated subqueries is one way to get the output you want.
I'm returning the Group with the lowest GroupKey when there is a match, but that is arbitrary as you say it doesn't matter.
test data:
query:
The subquery in the SELECT gets the highest
GroupScore
from only those groups that are subsets of this ('g1') group. It achieves this by counting the UNION of theRecordKey
's for the 'g1' set and each 'g2' set. If the UNION is larger than the 'g1' set, there must be at least oneRecordKey
in the 'g2' set without a correspondingRecordKey
for the 'g1' set, so the 'g2' set is not a subset and should not be considered for this row.In the WHERE clause, there are two cases to consider for filtering. In either case, the 'g1' set is only filtered if all the 'g1'
RecordKey
s are also present in the 'g3' set; and this check is achieved by counting the union again (as per the SELECT clause).The two cases are: ① the 'g1' set has fewer
RecordKey
s (g3.RecordCount>g1.RecordCount
; in which case we filter), and ② the 'g1' set is identical to the 'g3' set (g3.RecordCount=g1.RecordCount
; in which case we arbitrarily choose the set with the lowerGroupKey
)output:
dbfiddle here