Sql-server – MSSQL Select all parent rows for which exists a combination of children rows with desired value

optimizationquery-performancerowselectsql server

I would like to select all parent rows for which exists a combination of children rows with desired value. The easiest approach is to use multiple exists clauses, but this is not optimal – anyone knows better solution? Here is a SQL fiddle: http://sqlfiddle.com/#!6/8da76/5

Schema:

CREATE TABLE [dbo].[Parents](
    [Id] [bigint] IDENTITY(1,1) NOT NULL,
    [GroupId] [uniqueidentifier] NOT NULL
);

INSERT INTO [Parents] ([GroupId]) VALUES ('00000000-0000-0000-0000-000000000001');
INSERT INTO [Parents] ([GroupId]) VALUES ('00000000-0000-0000-0000-000000000002');
INSERT INTO [Parents] ([GroupId]) VALUES ('00000000-0000-0000-0000-000000000003');
INSERT INTO [Parents] ([GroupId]) VALUES ('00000000-0000-0000-0000-000000000004');
INSERT INTO [Parents] ([GroupId]) VALUES ('00000000-0000-0000-0000-000000000005');

CREATE TABLE [dbo].[Childrens](
    [Id] [bigint] IDENTITY(1,1) NOT NULL,
    [ParentId] [bigint] NOT NULL,
    [TypeId] [int] NOT NULL
);

INSERT INTO [Childrens] ([ParentId], [TypeId]) VALUES (1, 1);
INSERT INTO [Childrens] ([ParentId], [TypeId]) VALUES (1, 2);
INSERT INTO [Childrens] ([ParentId], [TypeId]) VALUES (1, 3);
INSERT INTO [Childrens] ([ParentId], [TypeId]) VALUES (1, 4);
INSERT INTO [Childrens] ([ParentId], [TypeId]) VALUES (1, 5);

INSERT INTO [Childrens] ([ParentId], [TypeId]) VALUES (2, 1);
INSERT INTO [Childrens] ([ParentId], [TypeId]) VALUES (2, 4);
INSERT INTO [Childrens] ([ParentId], [TypeId]) VALUES (2, 5);

INSERT INTO [Childrens] ([ParentId], [TypeId]) VALUES (3, 4);
INSERT INTO [Childrens] ([ParentId], [TypeId]) VALUES (3, 5);
INSERT INTO [Childrens] ([ParentId], [TypeId]) VALUES (3, 6);

INSERT INTO [Childrens] ([ParentId], [TypeId]) VALUES (4, 2);
INSERT INTO [Childrens] ([ParentId], [TypeId]) VALUES (4, 6);

INSERT INTO [Childrens] ([ParentId], [TypeId]) VALUES (5, 1);
INSERT INTO [Childrens] ([ParentId], [TypeId]) VALUES (5, 2);
INSERT INTO [Childrens] ([ParentId], [TypeId]) VALUES (5, 3);

Sample query to select all parent rows for which exists a combination of children rows with TypeId = 1 and 2 (query should return parent with Id = 1 and 5):

SELECT *
  FROM Parents p
  WHERE EXISTS( 
        SELECT *
          FROM Childrens c
          WHERE p.Id = c.ParentId
            AND c.TypeId = 1 )
    AND EXISTS( 
        SELECT *
          FROM Childrens c
          WHERE p.Id = c.ParentId
            AND c.TypeId = 2 );

Sample query to select all parent rows for which exists children row with TypeId = 4, 5 and 6 (should return parent with Id = 3)

SELECT *
  FROM Parents p
  WHERE EXISTS( 
        SELECT *
          FROM Childrens c
          WHERE p.Id = c.ParentId
            AND c.TypeId = 4 )
    AND EXISTS( 
        SELECT *
          FROM Childrens c
          WHERE p.Id = c.ParentId
            AND c.TypeId = 5 )
    AND EXISTS( 
        SELECT *
          FROM Childrens c
          WHERE p.Id = c.ParentId
            AND c.TypeId = 6 );

Number of conditions for children TypeId is variable.

Best Regards!

Best Answer

Below is one method. In the outer WHERE clause, specify the number of types included in the IN clause. An composite index on the Childrens table ParentId and TypeId will help optimize the query.

SELECT  *
FROM    Parents p
WHERE   2 = ( SELECT    COUNT(DISTINCT c.TypeId)
              FROM      Childrens c
              WHERE     p.Id = c.ParentId
                        AND c.TypeId IN ( 1, 2 )
            );

SELECT  *
FROM    Parents p
WHERE   3 = ( SELECT    COUNT(DISTINCT c.TypeId)
              FROM      Childrens c
              WHERE     p.Id = c.ParentId
                        AND c.TypeId IN ( 4, 5, 6 )
            );

EDIT: The original DDL in the question didn't include constraints or indexes. These aren't relevant for query functionality but are important to compare performance of alternative solutions.

Assuming the IDENTITY columns in both tables are also the primary keys and the tables are related, I added the constraints below.

ALTER TABLE Parents
    ADD CONSTRAINT PK_Parents PRIMARY KEY CLUSTERED(Id);

ALTER TABLE Childrens
    ADD CONSTRAINT PK_Childrens PRIMARY KEY CLUSTERED(Id);

ALTER TABLE Childrens
    ADD CONSTRAINT FK_Childrens_Parents FOREIGN KEY (ParentId) REFERENCES Parents(Id);

I added candidate indexes on columns specified on JOIN or WHERE clauses and specified UNIQUE for those indexes that include the primary key column. Indexes not deemed to be used can be dropped afterward.

CREATE UNIQUE INDEX idx1 ON Parents (Id, GroupId);
CREATE INDEX idx1 ON Childrens (ParentId, TypeId);
CREATE INDEX idx2 ON Childrens (TypeId, ParentId);

Finally, I ran the original query in my answer in SSMS along with the query below which Tpsamw1 proposed in a comment (but without ORDER BY to level the playing field). SET STATISTICS IO ON and including the actual execution plan:

SELECT  p.Id
      , p.GroupId
FROM    Parents p
        INNER JOIN Childrens c ON c.ParentId = p.Id
WHERE   c.TypeId IN ( 4, 5, 6 )
GROUP BY p.Id
      , p.GroupId
HAVING  COUNT(DISTINCT c.TypeId) = 3;

SSMS reported comparable performance (50% of batch cost for each query) but that doesn't tell the whole story. The STATISTICS IO showed a higher number of logical reads for my original query, which I believe a better indicator of performance. This suggests Tpsamw1's query performs better. Below are the stats:

My original query:

Table 'Childrens'. Scan count 15, logical reads 30, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Parents'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Tpsamw1 query:

Table 'Parents'. Scan count 4, logical reads 8, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Childrens'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

I then loaded additional data by copying the original test data and update stats:

INSERT INTO Parents SELECT NEWID() FROM Parents WHERE Id = 1;
INSERT INTO Childrens SELECT SCOPE_IDENTITY(), TypeId FROM Childrens WHERE ParentId = 1;
INSERT INTO Parents SELECT NEWID() FROM Parents WHERE Id = 2;
INSERT INTO Childrens SELECT SCOPE_IDENTITY(), TypeId FROM Childrens WHERE ParentId = 2;
INSERT INTO Parents SELECT NEWID() FROM Parents WHERE Id = 3;
INSERT INTO Childrens SELECT SCOPE_IDENTITY(), TypeId FROM Childrens WHERE ParentId = 3;
INSERT INTO Parents SELECT NEWID() FROM Parents WHERE Id = 4;
INSERT INTO Childrens SELECT SCOPE_IDENTITY(), TypeId FROM Childrens WHERE ParentId = 4;
INSERT INTO Parents SELECT NEWID() FROM Parents WHERE Id = 5;
INSERT INTO Childrens SELECT SCOPE_IDENTITY(), TypeId FROM Childrens WHERE ParentId = 5;
GO 1000

UPDATE STATISTICS dbo.Childrens WITH FULLSCAN;
UPDATE STATISTICS dbo.Parents WITH FULLSCAN;
GO

Running the queries again shows execution plan changes to both queries. Both queries yielded the same number of logical IOs, indicating performance is be about the same. The only execution plan difference was an inner merge join versus a right outer merge join.

Table 'Parents'. Scan count 1, logical reads 21, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Childrens'. Scan count 1, logical reads 54, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.