SQL Server 2014 – How to Select Only Supersets

sql serversql server 2014t-sql

I have two tables (along with a nonclustered index) that can be created with the below commands:

CREATE TABLE GroupTable
(
  GroupKey int NOT NULL PRIMARY KEY, 
  RecordCount int NOT NULL,
  GroupScore float NOT NULL
);

CREATE TABLE RecordTable
(
  RecordKey varchar(10) NOT NULL, 
  GroupKey int NOT NULL,
  PRIMARY KEY(RecordKey, GroupKey)
);

CREATE UNIQUE INDEX ixGroupRecord ON RecordTable(GroupKey, RecordKey);

While technically my tables are slightly different and I am joining on a few other tables, this is a suitable proxy for my situation.

I would like to select all GroupKeys that are not subsets of another GroupKey.
For a given superset, I would like to grab the maximum GroupScore of all its subsets (including itself).
In the instance when a GroupKey contains the same exact RecordKeys as another GroupKey(s), then only one of those GroupKeys is grabbed (it doesn't matter which one).
Any GroupKey that has the same exact RecordKeys as another GroupKey(s) will also have the same GroupScore.
Non-related GroupKeys can have the same score as well.

The following is an example to illustrate what I am asking:

              GroupTable                          RecordTable

GroupKey    RecordCount   GroupScore         RecordKey    GroupKey
------------------------------------         ---------------------
  1              3            6.2                A          1
  29             2            9.8                A          29
  95             3            6.2                A          95
  192            4            7.1                A          192
                                                 B          1
                                                 B          29
                                                 B          95
                                                 B          192
                                                 C          1
                                                 C          95
                                                 D          192
                                                 E          192

I would like the output to be the following:

GroupKey    RecordCount    GroupScore
-------------------------------------
  1              3             9.8
  192            4             9.8

GroupTable has around 75M rows, and RecordTable has around 115M rows; however, after the joins and WHERE predicate, there tends to be around 20k rows on a given day.

I apologize if this question is trivial, but for some reason I'm really struggling with it.

Best Answer

I would like the output to be the following:

 GroupKey    RecordCount    GroupScore
 -------------------------------------
   1              3             9.8
   192            4             7.1

Using correlated subqueries is one way to get the output you want.

In the instance when a GroupKey contains the same exact RecordKeys as another GroupKey(s), then only one of those GroupKeys is grabbed (it doesn't matter which one).

I'm returning the Group with the lowest GroupKey when there is a match, but that is arbitrary as you say it doesn't matter.

test data:

INSERT INTO RecordTable(RecordKey,GroupKey)
VALUES ('A',1)
     , ('A',29)
     , ('A',95)
     , ('A',192)
     , ('B',1)
     , ('B',29)
     , ('B',95)
     , ('B',192)
     , ('C',1)
     , ('C',95)
     , ('D',192)
     , ('E',192);

INSERT INTO GroupTable(GroupKey,RecordCount,GroupScore)
VALUES (1,3,6.2)     -- ABC
     , (29,2,9.8)    -- AB
     , (95,3,6.2)    -- ABC
     , (192,4,7.1);  -- ABDE
GO

query:

SELECT GroupKey
     , RecordCount
     , GroupScore = ( SELECT max(GroupScore)
                      FROM GroupTable g2 
                      WHERE ( SELECT count(*)
                              FROM ( SELECT RecordKey
                                     FROM RecordTable
                                     WHERE GroupKey=g1.GroupKey
                                     UNION
                                     SELECT RecordKey
                                     FROM RecordTable
                                     WHERE GroupKey=g2.GroupKey ) z
                            )=g1.RecordCount )
FROM GroupTable g1
WHERE NOT EXISTS ( SELECT *
                   FROM GroupTable g3
                   WHERE ( SELECT count(*)
                           FROM ( SELECT RecordKey
                                  FROM RecordTable 
                                  WHERE GroupKey=g1.GroupKey 
                                  UNION
                                  SELECT RecordKey 
                                  FROM RecordTable 
                                  WHERE GroupKey=g3.GroupKey ) z )=g3.RecordCount
                         AND ( g3.RecordCount>g1.RecordCount 
                               OR ( g3.RecordCount=g1.RecordCount 
                                    AND g3.GroupKey<g1.GroupKey ) ) );
GO

The subquery in the SELECT gets the highest GroupScore from only those groups that are subsets of this ('g1') group. It achieves this by counting the UNION of the RecordKey's for the 'g1' set and each 'g2' set. If the UNION is larger than the 'g1' set, there must be at least one RecordKey in the 'g2' set without a corresponding RecordKey for the 'g1' set, so the 'g2' set is not a subset and should not be considered for this row.

In the WHERE clause, there are two cases to consider for filtering. In either case, the 'g1' set is only filtered if all the 'g1' RecordKeys are also present in the 'g3' set; and this check is achieved by counting the union again (as per the SELECT clause).

The two cases are: ① the 'g1' set has fewer RecordKeys (g3.RecordCount>g1.RecordCount; in which case we filter), and ② the 'g1' set is identical to the 'g3' set (g3.RecordCount=g1.RecordCount; in which case we arbitrarily choose the set with the lower GroupKey)

output:

/*
|GroupKey|RecordCount|GroupScore|
|-------:|----------:|---------:|
|       1|          3|       9.8|
|     192|          4|       9.8|
*/

dbfiddle here

Related Solutions

Sql-server – How to create a relationship to a non-primary key in SQL Server

Create a unique index or unique constraint on UserName then you can reference it in a FK constraint fine.

Your statement that

Sql Server doesn't allow me to create a relationship on a non primary key column

is incorrect. SQL Server only cares that the column(s) participating in the FK relationship have a unique index defined.

Sql-server – Do I need a separate Id column for this “mapping” table

If you have a one-to-many relationship between Producers and Products (in other words, a product can only belong to one producer), than it would make sense to just put a foreign key reference directly in your Products table:

One-To-Many

create table Producer ( id int identity(1, 1) not null primary key clustered, Name varchar(100) not null ) go create table Product ( id int identity(1, 1) not null, Name varchar(100) not null, ProducerId int not null foreign key references Producer(id) ) go

But if there's a chance that this will be a many-to-many relationship, then your best bet would be to use a Join table.

Many-To-Many

create table Producer
(
    id int identity(1, 1) not null primary key clustered,
    Name varchar(100) not null
)
go

create table Product
(
    id int identity(1, 1) not null primary key clustered,
    Name varchar(100) not null
)
go

create table ProductProducer
(
    ProductId int not null foreign key references Product(id),
    ProducerId int not null foreign key references Producer(id)
)
go

-- adding the primary key also ensures uniqueness
alter table ProductProducer
add constraint PK_ProductProducer 
primary key (ProductId, ProducerId)
go

If you decide to go with the Join table, you wouldn't need to have an additional key, as the combination of ProductId/ProducerId would ultimately be unique. You could use them as a composite key, so you wouldn't need that additional Id field in ProductProducer.

Best Answer

Related Solutions

Sql-server – How to create a relationship to a non-primary key in SQL Server

Sql-server – Do I need a separate Id column for this “mapping” table

Related Question