Sql-server – Top clause slows down query

optimizationperformancesql servert-sql

I already received an answer on similar question, but it's not applicable in the case below.

In last time, I just rewrited my query on two distinct, first was fetching first N rows from one table and then it was joined (if and only if it contains sufficient amount of data). But here there is enough of data, but query still be very slow. Here is simplified example of slow query:

DECLARE @Top int = 1000

SELECT TOP (@Top) *
FROM (
    SELECT [t0].[DateTimeUtc] AS [value], [t2].[SystemName], [t2].[Name], [t0].[Id], [t1].[Discriminator], [t1].[ParentActionTemplateId], [t0].[DateTimeUtc], [t1].[Id] AS [Id2]
    FROM [directcrm].[CustomerActions] AS [t0]
    INNER JOIN [directcrm].[ActionTemplates] AS [t1] ON [t1].[Id] = [t0].[ActionTemplateId]
    LEFT OUTER JOIN [directcrm].[ActionTemplates] AS [t2] ON [t2].[Id] = [t1].[ParentActionTemplateId]
    ) AS [t7]
WHERE (([t7].[Discriminator] = @p9) OR ([t7].[Discriminator] = @p10) OR ([t7].[Discriminator] = @p11) OR ([t7].[Discriminator] = @p12)) AND ([t7].[ParentActionTemplateId] IS NOT NULL) AND (([t7].[DateTimeUtc]) <= @p13) AND (([t7].[DateTimeUtc]) > @p14) AND ([t7].[Id] > @p15)
ORDER BY [t7].[Id]

But if we hack optimizer:

DECLARE @Top int = 1000

SELECT TOP (@Top) *
FROM (
    SELECT [t0].[DateTimeUtc] AS [value], [t2].[SystemName], [t2].[Name], [t0].[Id], [t1].[Discriminator], [t1].[ParentActionTemplateId], [t0].[DateTimeUtc], [t1].[Id] AS [Id2]
    FROM [directcrm].[CustomerActions] AS [t0]
    INNER JOIN [directcrm].[ActionTemplates] AS [t1] ON [t1].[Id] = [t0].[ActionTemplateId]
    LEFT OUTER JOIN [directcrm].[ActionTemplates] AS [t2] ON [t2].[Id] = [t1].[ParentActionTemplateId]
    ) AS [t7]
WHERE (([t7].[Discriminator] = @p9) OR ([t7].[Discriminator] = @p10) OR ([t7].[Discriminator] = @p11) OR ([t7].[Discriminator] = @p12)) AND ([t7].[ParentActionTemplateId] IS NOT NULL) AND (([t7].[DateTimeUtc]) <= @p13) AND (([t7].[DateTimeUtc]) > @p14) AND ([t7].[Id] > @p15)
ORDER BY [t7].[Id]

OPTION (OPTIMIZE FOR (@TOP = 100000000)) -- < here

We get an instant answer, because in first case it uses loop join which is slow as hell, in second – hash join.

Of course I rebuilded all statistics and all indices for all tables in database.

I don't know what can I do with this query, because it's just a simple join without any magic. I tried build some indices, rewrite query and so on, but optimizer always chooses the same plan and runs very slow. I'm unable to use hints because of ORM and because they seems to be really hacky. I use them to determine the bottleneck and then fix it by using DDL modifications. It usually works, but today it failed.

Best Answer

When possible you should update actual execution plans to Paste The Plan. Without the XML we have to make guesses about the operators shown in the images. For the rest of this answer I'm going to assume that the clustered key of all of the tables is id.

As far as I can tell the problem isn't with the nested loop join but with the table access method on the [directcrm].[CustomerActions] table. It looks like SQL Server does a clustered index seek in order starting with [t7].[Id] > @p15. However, it's possible that the storage engine will need to seek many rows until it finds enough that match the filter requirements against the [DateTimeUtc] column. If you're on SQL Server 2016 you can check this by looking at "Number of Rows Read" in the actual query plan for the clustered index seek. It's theoretically possible for the clustered index seek on [directcrm].[ActionTemplates] AS [t1] to also be a problem. However, it looks like the filtering against that table only removes 30% of the rows so I doubt that's an issue.

Based on just how the query is written the biggest problem is that SQL Server is not aware of the values of the local variables. As a result the query optimizer makes cardinality estimates based on hardcoded results. For example, whenever you have an unknown expression in TOP the query optimizer uses a hardcoded guess of 100 for that value. I understand that you are constrained by your ORM but you're essentially fighting with the query optimizer.

I have a few ideas on how to improve performance listed below in rough order of preference. It's likely that you won't be able to implement some of them due to ORM restrictions but I thought I should include them all for completeness.

A good test is try is the RECOMPILE hint. That may be enough to get SQL Server to pick a better plan. I understand that you can't use that one but perhaps you could replace some of the variables with hardcoded values? A TOP statement with a variable is going to be really bad because the query plan will never change as you change the value of the variable. The optimizer will always set a row goal of 100 when creating the plan (based on testing I did tonight).

Since you got better performance when not using the clustered index on [directcrm].[CustomerActions] as the driving index you could try to make the clustered seek more expensive or unavailable. One way to make it more expensive is to introduce some doubt about the order of the rows. Right now you order by t0.id which means that plans that use that clustered index can avoid a sort. If you did something like the following that might be enough:

ORDER BY CASE WHEN @sort = 1 THEN [Discriminator] ELSE [Id] END

Just set @sort always equal to 0 so you always get the sort that you want when the query is executed.

Another trick is to add superfluous operators to parts of the WHERE clause to prevent index use. In some cases just adding 0 won't be optimized away and an index seek will be unavailable. You can strategically add things like that to the WHERE clause to encourage the index usage that you want. For example, if you don't want a clustered index seek on [directcrm].[CustomerActions] you can start with this and make it more complicated if that isn't sufficient:

[t7].[Id] + 0 > @p15

You may be able to create a covering index against [directcrm].[ActionTemplates] to encourage that table to be used as the driving table. Of course you could always drop the clustered index on [directcrm].[CustomerActions] as well but I imagine that will have side effects that you can't ignore.

The nuclear option is to use a plan guide. There are many ways to create plan guides. It's possible to force a query to use a certain plan without changing the query's text. For example, if you can get a good query plan in the cache (such as with your row goal trick) then you can use sp_create_plan_guide_from_handle to force a plan guide to the query. Keep in mind that your query text needs to exactly match what is stored in the plan guide. If you're on SQL Server 2016 you could also try freezing a query plan using the Query Store.

Window Functions, or the OVER() Clause

It appears that what you are looking for is a window function or an OVER() clause.

In your original example, you are trying to use two max() conditions, which doesn't work, because when you try to then apply GROUP BY, you can have a condition where the max of the first column ID and the second column startdate aren't in the same row, and so then GROUP BY simply can't be understood.

However, if you are looking to extract the max over a grouping, and you can define that grouping, and you want to obtain the 'column-wise' maximum for more than one column for that grouping (as you seem to want to do), then here's the solution.

CREATE TABLE test (
  jobcode integer,
  id integer, 
  a_startdate datetime,
  a_enddate datetime,
  b_startdate datetime,
  b_enddate datetime);

  INSERT INTO test VALUES (513801, 7136, '11-01-2011', '12-31-9998', '11-01-2011', '12-31-9998');
  INSERT INTO test VALUES (513801, 7137,'04-26-2014', '12-31-9998','04-26-2014', '12-31-9998');

I first made this table to recreate your input data in the picture. I then applied this query.

SELECT jobcode, max(id) OVER (PARTITION BY jobcode), 
max(a_startdate) OVER(PARTITION BY jobcode), 
a_enddate,  b_startdate, b_enddate FROM test;

Try out this SQL Fiddle, and see if it gives the results you're looking for. I did the best I could with the description I had.

Postgresql – Optimize query to find top N users who commented on a post

Better data types

text is a sub-optimal data type for key columns. It would be more efficient to use integer. Related:

Indexes: integer vs string performance if the number of nodes in the index is the same

'26c72242-7e3b-4982-92c5-021b622d7ecd' in your example looks like a UUID. If you need to use UUIDs, still don't store them as text. The appropriate data type would be uuid - much more efficient. Details:

Would index lookup be noticeably faster with char vs varchar when all values are 36 chars

Indexes

You only use single-column indexes. Since you need to optimize read performance for your query, add these two multicolumn indexes:

CREATE INDEX fc_mv_user_id_follower_count_idx ON follower_count_mv (user_id, follower_count DESC);
CREATE INDEX upc_post_id_user_id_idx ON user_post_comment (post_id, user_id);

@Ziggy already mentioned the second.
The order of index columns is important:

Is a composite index also good for queries on the first field?

Btree indexes can be scanned backwards at practically the same cost. But it's even slightly faster to use matching descending sort order. (There's a corner case with NULL values.)

Query

For a single post like in your examples it won't get faster than this:

SELECT ua.*
FROM  (
   SELECT user_id, fc.follower_count 
   FROM  (
      SELECT DISTINCT user_id
      FROM   user_post_comment
      WHERE  post_id = '26c72242-7e3b-4982-92c5-021b622d7ecd'
      ) pc
   JOIN   follower_count_mv fc USING (user_id)
   ORDER  BY fc.follower_count DESC
   LIMIT  5
   ) sub
JOIN   user_account ua USING (user_id)
ORDER  BY sub.follower_count DESC;

This is assuming that the same user can comment on the same post multiple times. It's cheapest to fold duplicates before joining to follower_count_mv.

And join to follower_count_mv directly. It's expensive and useless to use user_account as stepping stone.

Only join to user_account after reducing to the top 5.

You did not specify, but your queries have an outer ORDER BY follower_count DESC. I only include follower_count in the subquery for the outer sort.

If (post_id, user_id) is unique (users can only comment once on each post), simplify to:

SELECT ua.*
FROM  (
   SELECT user_id, fc.follower_count
   FROM   user_post_comment pc
   JOIN   follower_count_mv fc USING (user_id)
   WHERE  pc.post_id = '26c72242-7e3b-4982-92c5-021b622d7ecd'
   ORDER  BY fc.follower_count DESC
   LIMIT  5
   ) sub
JOIN   user_account ua USING (user_id)
ORDER  BY sub.follower_count DESC;

Getting the top N for multiple posts at once is a bit more complex. Detailed explanation in chapter 2a of this related answer:

Optimize GROUP BY query to retrieve latest record per user

Best Answer

Related Solutions

Sql-server – how to select top ID from join operation and staging tables

Window Functions, or the OVER() Clause

Postgresql – Optimize query to find top N users who commented on a post

Better data types

Indexes

Query

Related Question