Sql-server – How to make this aggregation query more efficient

optimizationsql server 2014

I have a query that's taking quite a long time on my machine (7 minutes) to execute and I was wondering whether I could make it (significantly) quicker:

SELECT
    rec.[Id] AS RecordId,
    MIN(rec.[CreationDate]) AS RecordCreationDate,
    MIN(rec.[LastModified]) AS RecordLastModified,
    MIN(rec.[AssetType]) AS RecordAssetType,
    MIN(rec.[MasterFilename]) AS RecordMasterFilename,
    MIN(rec.[GameName]) AS RecordGameName,
    usr.[OrganizationName],
    COUNT(hist.[Id]) AS TimesDownloaded
FROM
(
    SELECT
        innerRec.Id,
        MIN(innerRec.CreationDate) AS CreationDate,
        MIN(innerRec.LastModified) AS LastModified,
        MIN(innerRec.AssetType) AS AssetType,
        MIN(innerRec.MasterFilename) AS MasterFilename,
        MIN(innerRec.GameName) AS GameName
    FROM
        [dbo].[Record] innerRec INNER JOIN [dbo].[RecordClassificationLink] innerLnk ON innerRec.Id = innerLnk.RecordId
    -- WHERE (classification ID is foo or bar)
    GROUP BY
        innerRec.Id
    -- HAVING COUNT(innerLnk.ClassificationId) = (number of specified classifications)
) rec
CROSS JOIN
    [dbo].[AdamUser] usr
LEFT JOIN
    (SELECT * FROM [dbo].[MaintenanceJobHistory] WHERE [CreatedOn] > '2016-01-01 00:00:00' AND [CreatedOn] < '2016-12-01 00:00:00') hist ON usr.Name = hist.AccessingUser AND rec.Id = hist.RecordId
GROUP BY
    rec.Id, usr.OrganizationName

What it's doing is extracting data to be put in an Excel spreadsheet report (whether a spreadsheet is a good presentation of this data is outside the scope of this question 🙂 )

The first subquery pulls out records optionally filtered by a list of classification IDs. These are then cross joined with the user table, because each user table row actually contains the info we really need for this: the user's organization name. I then left join the maintenance job history table (storing an entry for each record download) in order to create multiple rows if a record has been accessed multiple times, then group by record ID and organization name to get a "number of record downloads per organization" count as TimesDownloaded.

The code that is reading this output then populates an associative array whose key is OrganizationName and whose value is TimesDownloaded, creating the equivalent of a dynamic PIVOT where each record row contains one column per organization each containing the count of the number of record downloads.

As you can imagine this runs quite slowly on a large dataset, as I said above; the one I'm working with has ~38000 Records and ~1000 users, meaning the cross join results in ~38,000,000 rows, but it seems conceptually necessary.

Can this be made significantly more efficient? Would it be better if I did the PIVOT in dynamic SQL instead?

The DBMS I'm using is SQL Server 2014.

Here are the schema definitions for the tables:

CREATE TABLE [dbo].[AdamUser](
    [Id] [uniqueidentifier] NOT NULL,
    [Name] [nvarchar](200) NOT NULL,
    [UserGroupName] [nvarchar](50) NOT NULL,
    [OrganizationName] [nvarchar](50) NOT NULL,
PRIMARY KEY CLUSTERED 
(
    [Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

GO

CREATE UNIQUE NONCLUSTERED INDEX [UIX_AdamUser_Name] ON [dbo].[AdamUser]
(
    [Name] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO

CREATE TABLE [dbo].[MaintenanceJobHistory](
    [Id] [uniqueidentifier] NOT NULL,
    [Data] [xml] NOT NULL,
    [CreatedOn] [datetime] NOT NULL,
    [Type] [nvarchar](512) NOT NULL,
    [RecordId] [uniqueidentifier] NOT NULL,
    [AccessingUser] [nvarchar](200) NOT NULL,
 CONSTRAINT [PK_MaintenanceJobHistory] PRIMARY KEY CLUSTERED 
(
    [Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

GO

CREATE NONCLUSTERED INDEX [IX_MaintenanceJobHistory_CreatedOn] ON [dbo].[MaintenanceJobHistory]
(
    [CreatedOn] ASC
)
INCLUDE (     [Id],
    [RecordId],
    [AccessingUser]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO

CREATE TABLE [dbo].[Record](
    [Id] [uniqueidentifier] NOT NULL,
    [CreationDate] [datetime] NOT NULL,
    [LastModified] [datetime] NOT NULL,
    [AssetType] [nvarchar](max) NULL,
    [MasterFilename] [nvarchar](max) NULL,
    [GameName] [nvarchar](max) NULL,
 CONSTRAINT [PK_Record] PRIMARY KEY CLUSTERED 
(
    [Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

GO

CREATE TABLE [dbo].[RecordClassificationLink](
    [Id] [uniqueidentifier] NOT NULL,
    [RecordId] [uniqueidentifier] NOT NULL,
    [ClassificationId] [uniqueidentifier] NOT NULL,
 CONSTRAINT [PK_RecordClassificationLink] PRIMARY KEY CLUSTERED 
(
    [Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

GO

Here's the execution plan: https://www.brentozar.com/pastetheplan/?id=Sy6LlXDXg

Typical output:

However this output gets turned into an Excel spreadsheet along the lines of the following by the calling program (hence its being like a PIVOT operation):

.----------------------------------------------------------------------.
| Filename | Creation Date | #times downloaded by: CompanyA | CompanyB | ...
| foo.png  | 1/2/3         |                       0        | 2        |
| bar.png  | 1/3/4         |                       3        | 1        |
...

UPDATE:

I ended up making things significantly more efficient by moving the PIVOT operation into the SQL query itself; that way, SQL Server only has to output the number of rows in the Record table rather than that multiplied by the number of organizations (not too bad until you get to hundreds of organizations, at which point it's a huge number). The operation still takes a few minutes, but it's much more bearable. Here's the query I finally decided to use:

SELECT *
FROM (
    SELECT
        rec.[Id]                         AS RecordId,
        'Org_' + usr.[OrganizationName]  AS OrganizationNamePrefixed,
        COUNT(hist.[Id])                 AS TimesDownloaded  -- To be aggregated by PIVOT
    FROM (
        SELECT
            innerRec.[Id]
        FROM
            [dbo].[Record] innerRec
        INNER JOIN
            [dbo].[RecordClassificationLink] innerLnk ON innerLnk.[RecordId] = innerRec.[Id]
        -- WHERE (classification ID is foo or bar), for optional classification filtering
        GROUP BY
            innerRec.[Id]
        -- HAVING COUNT(innerLnk.ClassificationId) = (number of specified classifications), for optional classification filtering
    ) rec
    CROSS JOIN [dbo].[AdamUser] usr
    LEFT JOIN (
        SELECT * FROM [dbo].[MaintenanceJobHistory] WHERE [CreatedOn] > 'eg. 2016-01-01 12:00:00' AND [CreatedOn] < 'eg. 2016-12-01 12:00:00'
    ) hist ON hist.[AccessingUser] = usr.[Name] AND hist.[RecordId] = rec.[Id]
    GROUP BY
        rec.[Id], usr.[OrganizationName]
) srcTable
PIVOT  -- Pivot around columns outside aggregation fn, eg. heading column [OrganizationNamePrefixed] & all other columns: [RecordId]
(
    MIN(srcTable.[TimesDownloaded]) FOR [OrganizationNamePrefixed] IN (...list of ~200 columns dynamically generated...)
) pivotTable
INNER JOIN [dbo].[Record] outerRec ON outerRec.[Id] = pivotTable.[RecordId]

I added various indexes and also made the PIVOT as efficient as possible by only selecting the aggregation column, the headings column, and the necessary other column(s) to pivot around. Lastly, I re-JOIN the Record table using the RecordId PK to get the extra record info per row.

Best Answer

The stuff that is probably taking a lot of time is a large number of Sort operations in your query plan. You can preempt those by sorting the data yourself, in the form of indexes.

Here are some index suggestions that I think would get you started:

CREATE INDEX IX_RecordId
    ON dbo.RecordClassificationLink (RecordId);

CREATE INDEX IX_MaintenanceJobHistory_ByUser
    ON dbo.MaintenanceJobHistory (AccessingUser, RecordId);

CREATE UNIQUE INDEX UIX_AdamUser_OrgnName
    ON dbo.AdamUser (OrganizationName, Name);

Then, you could modify your query a little bit to help the optimizer make some smart choices, like aggregating certain data streams before they're joined and create a much larger product that will take more time to aggregate:

SELECT
    rec.[Id] AS RecordId,
    MIN(rec.[CreationDate]) AS RecordCreationDate,
    MIN(rec.[LastModified]) AS RecordLastModified,
    MIN(rec.[AssetType]) AS RecordAssetType,
    MIN(rec.[MasterFilename]) AS RecordMasterFilename,
    MIN(rec.[GameName]) AS RecordGameName,
    usr.[OrganizationName],
    SUM(ISNULL(hist._count, 0)) AS TimesDownloaded
FROM
(
    SELECT
        innerRec.Id,
        MIN(innerRec.CreationDate) AS CreationDate,
        MIN(innerRec.LastModified) AS LastModified,
        MIN(innerRec.AssetType) AS AssetType,
        MIN(innerRec.MasterFilename) AS MasterFilename,
        MIN(innerRec.GameName) AS GameName
    FROM [dbo].[Record] innerRec
    INNER JOIN [dbo].[RecordClassificationLink] innerLnk ON innerRec.Id = innerLnk.RecordId
    -- WHERE (classification ID is foo or bar)
    GROUP BY
        innerRec.Id
    -- HAVING COUNT(innerLnk.ClassificationId) = (number of specified classifications)
) AS rec
CROSS JOIN [dbo].[AdamUser] AS usr
LEFT JOIN
    (SELECT AccessingUser, RecordId, COUNT(*) AS _count
     FROM [dbo].[MaintenanceJobHistory]
     WHERE [CreatedOn] > '2016-01-01 00:00:00' AND
           [CreatedOn] < '2016-12-01 00:00:00'
     GROUP BY AccessingUser, RecordId
    ) AS hist ON
        usr.Name = hist.AccessingUser AND
        rec.Id = hist.RecordId
GROUP BY
    rec.Id, usr.OrganizationName;

I've done the following:

the hist subquery is aggregated on AccessingUser, RecordId, and I've created a COUNT(*) AS _count. This query uses the new index IX_MaintenanceJobHistory_ByUser to perform very efficiently without any memory grant or hash tables.
Because of this, I've replaced COUNT(hist.Id) with SUM(ISNULL(hist._count, 0)) AS TimesDownloaded
The new index on dbo.RecordClassificationLink helps perform a smooth join with the Record table, but if you add your WHERE and HAVING, that index won't help you.
The new index on dbo.AdamUser also improves performance by eliminating a Sort operator - because you aggregate on the OrganizationName column, might as well have your data sorted on this from the get-go.

In my mind, that should give you the same result, but it's late here, so you'll have to verify the results yourself. :)

Here's my query plan:

EDIT: You could also simplify the rec part a bit - probably be a bit easier to read:

SELECT
    rec.[Id] AS RecordId,
    rec.[CreationDate] AS RecordCreationDate,
    rec.[LastModified] AS RecordLastModified,
    rec.[AssetType] AS RecordAssetType,
    rec.[MasterFilename] AS RecordMasterFilename,
    rec.[GameName] AS RecordGameName,
    usr.[OrganizationName],
    SUM(ISNULL(hist._count, 0)) AS TimesDownloaded
FROM
(
    SELECT
        Id,
        CreationDate,
        LastModified,
        AssetType,
        MasterFilename,
        GameName
    FROM [dbo].[Record]
    WHERE Id IN (
        SELECT RecordId
        FROM [dbo].[RecordClassificationLink]
        --- WHERE ClassificationId=...
        )
) AS rec
CROSS JOIN [dbo].[AdamUser] AS usr
LEFT JOIN
    (SELECT AccessingUser, RecordId, COUNT(*) AS _count
     FROM [dbo].[MaintenanceJobHistory]
     WHERE [CreatedOn] > '2016-01-01 00:00:00' AND
           [CreatedOn] < '2016-12-01 00:00:00'
     GROUP BY AccessingUser, RecordId
    ) AS hist ON
        usr.Name = hist.AccessingUser AND
        rec.Id = hist.RecordId
GROUP BY
    usr.OrganizationName,
    rec.[Id],
    rec.[CreationDate],
    rec.[LastModified],
    rec.[AssetType],
    rec.[MasterFilename],
    rec.[GameName],
    usr.[OrganizationName];

... and the plan looks a tiny bit better as well (look for the red Lazy spool at the bottom of the original plan, that is now gone.)

Related Solutions

Mysql – How to optimize this MySQL query further

After looking over the query, the tables, and the WHERE AND GROUP BY clauses, I recommend the following:

Recommendation #1) Refactor the Query

I reorganized the query to do three(3) things:

create smaller temp tables
Process the WHERE clause on those temp tables
Delay joining to the very last

Here is my proposed query:

SELECT
  sounds.*,srkeys.avg_rating,srkeys.votes
FROM
(
  SELECT AA.id,avg(BB.rating) AS avg_rating, count(BB.rating) AS votes
  (
    SELECT id FROM sounds
    WHERE blacklisted = false 
    AND   ready_for_deployment = true 
    AND   deployed = true 
    AND   type = "Sound" 
    AND   created_at > '2011-03-26 21:25:49'
  ) AA INNER JOIN
  (
    SELECT AAA.ratings,AAA.rateable_id
    FROM ratings AAA
    WHERE rateable_type = 'Sound'
  ) BB
  ON AA.id = BB.rateable_id
  GROUP BY BB.rateable_id
) srkeys INNER JOIN sounds USING (id);

Recommendation #2) Index the sounds table with an index that will accommodate the WHERE clause

The columns of this index include all the columns from the WHERE clause with static values first and moving target last

ALTER TABLE sounds ADD INDEX support_index
(blacklisted,ready_for_deployment,deployed,type,created_at);

I sincerely believe you will be pleasantly surprised. Give it a Try !!!

UPDATE 2011-05-21 19:04

I just saw the cardinality. OUCH !!! Cardinality of 1 for rateable_id. Boy, I feel stupid !!!

UPDATE 2011-05-21 19:20

Maybe making the index will be enough to improve things.

UPDATE 2011-05-21 22:56

Please run this:

EXPLAIN SELECT
  sounds.*,srkeys.avg_rating,srkeys.votes
FROM
(
  SELECT AA.id,avg(BB.rating) AS avg_rating, count(BB.rating) AS votes FROM
  (
    SELECT id FROM sounds
    WHERE blacklisted = false 
    AND   ready_for_deployment = true 
    AND   deployed = true 
    AND   type = "Sound" 
    AND   created_at > '2011-03-26 21:25:49'
  ) AA INNER JOIN
  (
    SELECT AAA.ratings,AAA.rateable_id
    FROM ratings AAA
    WHERE rateable_type = 'Sound'
  ) BB
  ON AA.id = BB.rateable_id
  GROUP BY BB.rateable_id
) srkeys INNER JOIN sounds USING (id);

UPDATE 2011-05-21 23:34

I refactored it again. Try This One Please:

EXPLAIN
  SELECT AA.id,avg(BB.rating) AS avg_rating, count(BB.rating) AS votes FROM
  (
    SELECT id FROM sounds
    WHERE blacklisted = false 
    AND   ready_for_deployment = true 
    AND   deployed = true 
    AND   type = "Sound" 
    AND   created_at > '2011-03-26 21:25:49'
  ) AA INNER JOIN
  (
    SELECT AAA.ratings,AAA.rateable_id
    FROM ratings AAA
    WHERE rateable_type = 'Sound'
  ) BB
  ON AA.id = BB.rateable_id
  GROUP BY BB.rateable_id
;

UPDATE 2011-05-21 23:55

I refactored it again. Try This One Please (Last Time):

EXPLAIN
  SELECT A.id,avg(B.rating) AS avg_rating, count(B.rating) AS votes FROM
  (
    SELECT BB.* FROM
    (
      SELECT id FROM sounds
      WHERE blacklisted = false 
      AND   ready_for_deployment = true 
      AND   deployed = true 
      AND   type = "Sound" 
      AND   created_at > '2011-03-26 21:25:49'
    ) AA INNER JOIN sounds BB USING (id)
  ) A INNER JOIN
  (
    SELECT AAA.ratings,AAA.rateable_id
    FROM ratings AAA
    WHERE rateable_type = 'Sound'
  ) B
  ON A.id = B.rateable_id
  GROUP BY B.rateable_id;

UPDATE 2011-05-22 00:12

I hate giving up !!!!

EXPLAIN
  SELECT A.*,avg(B.rating) AS avg_rating, count(B.rating) AS votes FROM
  (
    SELECT BB.* FROM
    (
      SELECT id FROM sounds
      WHERE blacklisted = false 
      AND   ready_for_deployment = true 
      AND   deployed = true 
      AND   type = "Sound" 
      AND   created_at > '2011-03-26 21:25:49'
    ) AA INNER JOIN sounds BB USING (id)
  ) A,
  (
    SELECT AAA.ratings,AAA.rateable_id
    FROM ratings AAA
    WHERE rateable_type = 'Sound'
    AND AAA.rateable_id = A.id
  ) B
  GROUP BY B.rateable_id;

UPDATE 2011-05-22 07:51

It has been bothering me that ratings is coming back with 2 million rows in the EXPLAIN. Then, it hit me. You might need another index on the ratings table which starts with rateable_type:

ALTER TABLE ratings ADD INDEX
rateable_type_rateable_id_ndx (rateable_type,rateable_id);

The goal of this index is to reduce the temp table that manipulates ratings so that it is less that 2 million. If we can get that temp table significantly smaller (at least half), then we can have a better hope in your query and mine working faster too.

After making that index, please Retry my original proposed query and also try yours:

SELECT
  sounds.*,srkeys.avg_rating,srkeys.votes
FROM
(
  SELECT AA.id,avg(BB.rating) AS avg_rating, count(BB.rating) AS votes
  (
    SELECT id FROM sounds
    WHERE blacklisted = false 
    AND   ready_for_deployment = true 
    AND   deployed = true 
    AND   type = "Sound" 
    AND   created_at > '2011-03-26 21:25:49'
  ) AA INNER JOIN
  (
    SELECT AAA.ratings,AAA.rateable_id
    FROM ratings AAA
    WHERE rateable_type = 'Sound'
  ) BB
  ON AA.id = BB.rateable_id
  GROUP BY BB.rateable_id
) srkeys INNER JOIN sounds USING (id);

UPDATE 2011-05-22 18:39 : FINAL WORDS

I had refactored a query in a stored procedure and added an index to help answer a question on speeding things up. I got 6 upvotes, had the answer accepted,and picked up a 200 bounty.

I had also refactored another query (marginal results) and added an index (dramatic results). I got 2 upvotes and had the answer accepted.

I added an index for yet another query challange and was upvoted once

and now your question.

Wanting to answers all questions like these (including yours) were inspired by a YouTube video I watched on refactoring queries.

Thank you again, @coneybeare !!! I wanted to answer this question to fullest extent possible, not just accept points or accolades. Now, I can feel that I earned the points !!!

Mysql – This query is choking the server. How to improve it

I believe you need to do COUNT( DISTINCT comments.id ), not COUNT( comments.id ) in your last query .

Update. Assuming petfb.id is a primary key in petfb,

select pet.*, ownrs.id AS ownrunique, ownrs.super, 
count(distinct comments.id) AS noOfCom, 
count(distinct case when feedback=1 OR feedback=100 then petfb.id end ) AS newLikes, 
count(distinct case when feedback=5 OR feedback=6 then petfb.id end) AS newDisLikes 
from pet 
join ownrs on pet.ownerID=ownrs.uniqueID
Left join comments on pet.id=comments.petid 
Left join petfb on pet.id=petfb.petid and petfb.feedback in (1,5,6,100)
 group by pet.id
 ORDER BY date DESC LIMIT 0,10

Best Answer

Related Solutions

Mysql – How to optimize this MySQL query further

Mysql – This query is choking the server. How to improve it

Related Question