Sql-server – How to make this nested query more efficient

performancequeryquery-performancesql server

I have 3 tables: Room, Conference, and Participant. Room has many Conferences, and Conference has many Participants. I need my query to display the fields from Room, as well as the number of associated Conferences it has, and the sum of the number of associated Participants each Conference has. Here's a cut-down version of the SELECT query I wrote to get this info; first, I just selected the room ID:

SELECT TOP(1000)
  rm.[Id]
FROM
  [Room] rm
LEFT JOIN (
  SELECT
    conf.[Id] AS [ConferenceId],
    MIN(conf.[Name]) AS [ConferenceName],
    MIN(conf.[RoomId]) AS [RoomId],
    COUNT(part.[Id]) AS CalcConferenceParticipantCount
  FROM
    [Conference] conf
  LEFT JOIN
    [Participant] part on part.[ConferenceId] = conf.[Id]
  GROUP BY
    conf.[Id]
  ) confData ON confData.[RoomId] = rm.[Id]
GROUP BY
  rm.[Id]

This was very fast as SQL Server was able to just pull the data from Room and pretty much ignore the subquery (see Trial 1 – Trial 4 in image below). Then I added in the ConferenceName field from the subquery, as well as a count of the number of conferences per room:

SELECT TOP(1000)
  rm.[Id],
  COUNT(confData.[ConferenceId]) AS CalcRoomConferenceCount,
  MIN(confData.[ConferenceName])
FROM
  [Room] rm
LEFT JOIN (
  SELECT
    conf.[Id] AS [ConferenceId],
    MIN(conf.[Name]) AS [ConferenceName],
    MIN(conf.[RoomId]) AS [RoomId],
    COUNT(part.[Id]) AS CalcConferenceParticipantCount
  FROM
    [Conference] conf
  LEFT JOIN
    [Participant] part on part.[ConferenceId] = conf.[Id]
  GROUP BY
    conf.[Id]
  ) confData ON confData.[RoomId] = rm.[Id]
GROUP BY
  rm.[Id]

This slowed down the query quite a bit, by a factor of about 100 (see Trial 5 – Trial 7 in image below). I then added in the participant count from the subquery, meaning there were 2 levels of aggregate functions being used:

SELECT TOP(1000)
  rm.[Id],
  COUNT(confData.[ConferenceId]) AS CalcRoomConferenceCount,
  MIN(confData.[ConferenceName]),
  SUM(confData.[CalcConferenceParticipantCount]) AS CalcRoomParticipantCount
FROM
  [Room] rm
LEFT JOIN (
  SELECT
    conf.[Id] AS [ConferenceId],
    MIN(conf.[Name]) AS [ConferenceName],
    MIN(conf.[RoomId]) AS [RoomId],
    COUNT(part.[Id]) AS CalcConferenceParticipantCount
  FROM
    [Conference] conf
  LEFT JOIN
    [Participant] part on part.[ConferenceId] = conf.[Id]
  GROUP BY
    conf.[Id]
  ) confData ON confData.[RoomId] = rm.[Id]
GROUP BY
  rm.[Id]

This further slowed down the query by a factor of about 4 (see Trial 8 – Trial 10 in image below). Here's the client statistics with data on the 10 trials:

Client stats

Here's the query plan of the slow query: https://www.brentozar.com/pastetheplan/?id=SJpyeec5Q

Is there a way I can make this kind of query – where I calculate an aggregate of a subquery's aggregate – more efficient?

Best Answer

I mocked up data by looking at the row counts in your tables, giving them an even data distribution, and making guesses about the schema:

DROP TABLE IF EXISTS [Room];

CREATE TABLE [Room] (
    [Id] BIGINT NOT NULL,
    FILLER VARCHAR(200) NOT NULL,
    PRIMARY KEY ([Id])
);

INSERT INTO [Room] WITH (TABLOCK)
SELECT TOP (3088) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), REPLICATE('Z', 200)
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);


DROP TABLE IF EXISTS [Conference];

CREATE TABLE [Conference] (
    [Id] BIGINT NOT NULL,
    [Name] VARCHAR(30) NOT NULL,
    [RoomId] BIGINT NOT NULL,
    FILLER VARCHAR(200) NOT NULL,
    PRIMARY KEY ([Id])
);


INSERT INTO [Conference] WITH (TABLOCK)
SELECT RN
, 'MY FAVORITE MEETING ROOM'
, 1 + RN % 3088
, REPLICATE('Z', 200)
FROM
(
    SELECT TOP (97413) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
    FROM master..spt_values t1
    CROSS JOIN master..spt_values t2
) q
OPTION (MAXDOP 1);


DROP TABLE IF EXISTS [Participant];

CREATE TABLE [Participant] (
    [Id] BIGINT NOT NULL,
    [ConferenceId] BIGINT NOT NULL,
    FILLER VARCHAR(200) NOT NULL,
    PRIMARY KEY ([Id])
);


INSERT INTO [Participant] WITH (TABLOCK)
SELECT RN
, 1 + RN % 97413
, REPLICATE('Z', 200)
FROM
(
    SELECT TOP (235323) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
    FROM master..spt_values t1
    CROSS JOIN master..spt_values t2
) q
OPTION (MAXDOP 1);


CREATE INDEX NCI_Part ON [Participant] ([ConferenceId]) INCLUDE (Id);

The most important assumption that I made about the schema is that the Id column is the primary key of the [Conference] table. This seemed reasonable given the query plan and the index names involved.

On my machine I get the same query plan as you, but my starting query only takes 163 ms of CPU. I assume the differences come down to differences in hardware, data distribution, and the fact that I'm not returning data to the client.

The first thing that jumped out to me is the unnecessary GROUP BY in your confData derived table. Id is the primary key of the table so you don't need all of the aggregates. With the right indexes (which you already have for this particular case), subqueries aren't necessarily a bad thing. Rewriting what you have to remove the GROUP BY:

SELECT TOP(1000)
  rm.[Id],
  COUNT(confData.[ConferenceId]) AS CalcRoomConferenceCount,
  MIN(confData.[ConferenceName]),
  SUM(confData.[CalcConferenceParticipantCount]) AS CalcRoomParticipantCount
FROM
  [Room] rm
LEFT JOIN (
  SELECT
    conf.[Id] AS [ConferenceId],
    conf.[Name] AS [ConferenceName],
    conf.[RoomId] AS [RoomId],
    (
        SELECT COUNT(part.[Id])
        FROM [Participant] part
        WHERE part.[ConferenceId] = conf.[Id]
    ) AS CalcConferenceParticipantCount
  FROM
    [Conference] conf
  ) confData ON confData.[RoomId] = rm.[Id]
GROUP BY
  rm.[Id]
OPTION (USE HINT('FORCE_LEGACY_CARDINALITY_ESTIMATION'));

This results in the stream aggregate getting pushed down further into the plan:

subquery

The uploaded plan takes 113 ms of CPU. The same operators are present, but some of them process fewer rows which saves time. You may be able to make this query more efficient by defining a covering index on the [Conference] with Id as the index key. This may seem like an odd thing to do, but your clustered index scan takes 10% of the overall query time and likely includes columns that you don't need.

If you want to make the query faster you could also consider an indexed view. Why perform the aggregation every time when you can define a simple indexed view to do it for you?

CREATE VIEW IndexedViewOnParticipant WITH SCHEMABINDING
AS
SELECT [ConferenceId], COUNT_BIG([Id]) CntId, COUNT_BIG(*) Cnt
FROM dbo.[Participant]
GROUP BY [ConferenceId];

GO

CREATE UNIQUE CLUSTERED INDEX CI ON IndexedViewOnParticipant ([ConferenceId]);

This will result in a little more space and a little bit of overhead when doing DML on the table. Overall I'd say that it's a good use case for an indexed view. Rewriting the query again:

SELECT TOP(1000)
  rm.[Id],
  COUNT(confData.[ConferenceId]) AS CalcRoomConferenceCount,
  MIN(confData.[ConferenceName]),
  SUM(confData.[CalcConferenceParticipantCount]) AS CalcRoomParticipantCount
FROM 
  [Room] rm
LEFT JOIN (
  SELECT
    conf.[Id] AS [ConferenceId],
    conf.[Name] AS [ConferenceName],
    conf.[RoomId] AS [RoomId],
    (
        SELECT CntId
        FROM IndexedViewOnParticipant part WITH (NOEXPAND)
        WHERE part.[ConferenceId] = conf.[Id]
    ) AS CalcConferenceParticipantCount
  FROM
    [Conference] conf
  ) confData ON confData.[RoomId] = rm.[Id]
GROUP BY
  rm.[Id]
OPTION (USE HINT('FORCE_LEGACY_CARDINALITY_ESTIMATION'));

SQL Server agrees with my assessment that it's a good idea and CPU time falls to 78 ms.

On my machine I was able to make the query even faster, but this is starting to get into optimizations that are somewhat risky because it may require a LOOP JOIN hint. That hint may not be a good idea as your query or the data in the table changes. It also may not be a good fit for your hardware. The idea behind this approach is to create a suitable index on [Conference] and to take full advantage of the TOP with a plan that only does nested loops. Here is the index that I added:

CREATE INDEX NCI_Conf ON [Conference] ([RoomId]) INCLUDE ([Name]);

Running the same query as before with a LOOP JOIN hint gave me the following plan:

loop join hint

That query only took 58 ms of CPU time. It's worth mentioning that I noticed requesting the actual plan adds quite a bit of relative overhead at this stage. All of the other possible optimizations that come to mind to me aren't safe for production so I'll stop here.

As a final thought, do you really want to return 1000 arbitrary rows and the minimum conference name? Is that information useful to your end users?