SQL Server 2014 – How to Select Only Supersets

sql serversql server 2014t-sql

I have two tables (along with a nonclustered index) that can be created with the below commands:

CREATE TABLE GroupTable
(
  GroupKey int NOT NULL PRIMARY KEY, 
  RecordCount int NOT NULL,
  GroupScore float NOT NULL
);

CREATE TABLE RecordTable
(
  RecordKey varchar(10) NOT NULL, 
  GroupKey int NOT NULL,
  PRIMARY KEY(RecordKey, GroupKey)
);

CREATE UNIQUE INDEX ixGroupRecord ON RecordTable(GroupKey, RecordKey);

While technically my tables are slightly different and I am joining on a few other tables, this is a suitable proxy for my situation.

  • I would like to select all GroupKeys that are not subsets of another GroupKey.
  • For a given superset, I would like to grab the maximum GroupScore of all its subsets (including itself).
  • In the instance when a GroupKey contains the same exact RecordKeys as another GroupKey(s), then only one of those GroupKeys is grabbed (it doesn't matter which one).
  • Any GroupKey that has the same exact RecordKeys as another GroupKey(s) will also have the same GroupScore.
  • Non-related GroupKeys can have the same score as well.

The following is an example to illustrate what I am asking:

              GroupTable                          RecordTable

GroupKey    RecordCount   GroupScore         RecordKey    GroupKey
------------------------------------         ---------------------
  1              3            6.2                A          1
  29             2            9.8                A          29
  95             3            6.2                A          95
  192            4            7.1                A          192
                                                 B          1
                                                 B          29
                                                 B          95
                                                 B          192
                                                 C          1
                                                 C          95
                                                 D          192
                                                 E          192

I would like the output to be the following:

GroupKey    RecordCount    GroupScore
-------------------------------------
  1              3             9.8
  192            4             9.8

GroupTable has around 75M rows, and RecordTable has around 115M rows; however, after the joins and WHERE predicate, there tends to be around 20k rows on a given day.

I apologize if this question is trivial, but for some reason I'm really struggling with it.

Best Answer

I would like the output to be the following:

 GroupKey    RecordCount    GroupScore
 -------------------------------------
   1              3             9.8
   192            4             7.1

Using correlated subqueries is one way to get the output you want.

  • In the instance when a GroupKey contains the same exact RecordKeys as another GroupKey(s), then only one of those GroupKeys is grabbed (it doesn't matter which one).

I'm returning the Group with the lowest GroupKey when there is a match, but that is arbitrary as you say it doesn't matter.

test data:

INSERT INTO RecordTable(RecordKey,GroupKey)
VALUES ('A',1)
     , ('A',29)
     , ('A',95)
     , ('A',192)
     , ('B',1)
     , ('B',29)
     , ('B',95)
     , ('B',192)
     , ('C',1)
     , ('C',95)
     , ('D',192)
     , ('E',192);

INSERT INTO GroupTable(GroupKey,RecordCount,GroupScore)
VALUES (1,3,6.2)     -- ABC
     , (29,2,9.8)    -- AB
     , (95,3,6.2)    -- ABC
     , (192,4,7.1);  -- ABDE
GO

query:

SELECT GroupKey
     , RecordCount
     , GroupScore = ( SELECT max(GroupScore)
                      FROM GroupTable g2 
                      WHERE ( SELECT count(*)
                              FROM ( SELECT RecordKey
                                     FROM RecordTable
                                     WHERE GroupKey=g1.GroupKey
                                     UNION
                                     SELECT RecordKey
                                     FROM RecordTable
                                     WHERE GroupKey=g2.GroupKey ) z
                            )=g1.RecordCount )
FROM GroupTable g1
WHERE NOT EXISTS ( SELECT *
                   FROM GroupTable g3
                   WHERE ( SELECT count(*)
                           FROM ( SELECT RecordKey
                                  FROM RecordTable 
                                  WHERE GroupKey=g1.GroupKey 
                                  UNION
                                  SELECT RecordKey 
                                  FROM RecordTable 
                                  WHERE GroupKey=g3.GroupKey ) z )=g3.RecordCount
                         AND ( g3.RecordCount>g1.RecordCount 
                               OR ( g3.RecordCount=g1.RecordCount 
                                    AND g3.GroupKey<g1.GroupKey ) ) );
GO

The subquery in the SELECT gets the highest GroupScore from only those groups that are subsets of this ('g1') group. It achieves this by counting the UNION of the RecordKey's for the 'g1' set and each 'g2' set. If the UNION is larger than the 'g1' set, there must be at least one RecordKey in the 'g2' set without a corresponding RecordKey for the 'g1' set, so the 'g2' set is not a subset and should not be considered for this row.

In the WHERE clause, there are two cases to consider for filtering. In either case, the 'g1' set is only filtered if all the 'g1' RecordKeys are also present in the 'g3' set; and this check is achieved by counting the union again (as per the SELECT clause).

The two cases are: ① the 'g1' set has fewer RecordKeys (g3.RecordCount>g1.RecordCount; in which case we filter), and ② the 'g1' set is identical to the 'g3' set (g3.RecordCount=g1.RecordCount; in which case we arbitrarily choose the set with the lower GroupKey)

output:

/*
|GroupKey|RecordCount|GroupScore|
|-------:|----------:|---------:|
|       1|          3|       9.8|
|     192|          4|       9.8|
*/

dbfiddle here