SQL Server – How to Speed Up a Large Subquery

performancesql-server-2012

In my organisation Ricoh is the supplier of printers and copiers. Papercut is used for registering which copies are made and for which account. Another application working with Ricoh is registering copies in a different database, but both are somewhat linked.

My organisation wanted a central place to retrieve the total cost so I wrote some queries. Most work fine, but something weird is going on with the one below and I'm unable to figure out why this is failing. What could be the reason of the following issue.

It's hard to make a db<>fiddle as the databases are quite complicated. I'll try to explain the issue here.

In essence there are two main tables. A table maintained by Ricoh containing client identifiers ClientId, a price (cost of the copywork), print timestamps and a job identifier. The table maintained by Papercut also has a timestamp, a printed flag and a reference to the corresponding JobId in its job_comment.

Now consider the following query which gives the total printcost when having printed using the seperate Ricoh application.

SELECT CliendId,
       SUM(ricoh.price) AS cost
FROM
  (SELECT a.SystemId as ClientId,
          a.ProcessInternalUid as JobId,
          a.price
   FROM   ricoh.printhistory AS a
   WHERE a.Exportdate >= '2020-10-01'
     AND a.Exportdate <= '2020-12-12') AS ricoh
INNER JOIN
  (SELECT substring(b.job_comment, 16, 36) AS JobId
   FROM  papercut.printer_usage_log AS b
   WHERE b.usage_date >= '2020-10-01'
     AND b.usage_date <= '2020-12-12'
     AND b.printed = 'Y'
   GROUP BY substring(b.job_comment, 16, 36)) AS papercut ON (papercut.JobId = ricoh.JobId)
GROUP BY ricoh.CliendId

If I run this query I have instant results, where the subqueries ricoh and papercut have 96.321 and 9.354 rows respectivly.

But if I change the lower bound of the date range to 2020-11-01 the query takes 49 minutes (!). While the subqueries have 46.223 and 4.547 records respectivly.

How can this be? How can a join with larger subqueries result in much faster results? I'm really puzzled by this. What could be a reason for this increase in time? (Then I can try to debug further)

I've done some (initial) debugging and noticed the following:

There is a limit date. If I set the lower bound to anything higher then 2020-10-08, the execution time gets large. I went looking for some troublesome records with that timestamp but found none. Also, choosing any date higher, like even 2020-12-01 results in high execution times.
If I remove the GROUP BY aggregation (and the sum) execution is instantaneous. But if I add an ORDER BY ricoh.price then execution time rises again. Could something with the price be an issue? It's a floating point number. I tried converting it to an integer but the issues remains.

EDIT:

Added the execution plans as requested by user J.D. I'm not really that familiar with these plans. I would guess the main difference is that Hash Match?

'Fast' query: Execution plan
'Slow' query: Execution plan

Also, this query is running on SQL server 2012. And since the database is managed by external firms I cannot add indexes. Is there a way to rewrite the query to force the 'fast' execution plan?

Best Answer

Almost guaranteed what you're experiencing is a change in the Execution Plan to cause a different operation (e.g. an Index Scan instead of Index Seek) to occur because there's a different cardinality for the data being returned for one date range vs the other.

Cardinality is the uniqueness of results being returned from the tables being queried. SQL Server's Engine uses cardinality to determine what types of operations are best to use when finding and serving up the data. Then it generates an Execution Plan which is the full set of operations it needs to undertake to serve the data.

The cardinality is determined off of the predicates that you use in your WHERE, JOIN, and HAVING clauses.

For example, say you have a table of Cars with 10 records and the column Manufacturer, and of those 10 records 8 of them have the Manufacturer set to "Honda" and 2 of them are set to "BMW". "BMW" is a lot more unique of a value in the table than "Honda" is. Therefor an Index Seek is a much more performant operation to find and serve the data if your query was to do a search WHERE Manufacturer = 'BMW'. But "Honda" is a lot more common of a value in the Cars table and if you changed your query to WHERE Manufacturer = 'Honda' then an Index Scan operation would be more performant (and the Execution Plan to get the data would change). This is how it works for all types of columns (dates, ints, floats, etc).

SQL Server stores statistics on these values to help determine the cardinality but sometimes it makes misestimations of the cardinality when serving the data which results in sub-optimal execution plans. Without seeing your execution plans specifically, this is the best we can speculate what the issue is, but it seems very likely it's a cardinality estimate issue.

The big takeaway is that the amount of data being returned isn't the only thing that affects performance of a query, and actually more times than not it's not the reason for bad performance. There's a lot more things that happen under the hood of the engine when serving up the data.

Per your updated question, here's some additional information related to the aforementioned paragraphs.

Here's the most relevant piece of your slow query's execution plan:

Notice how the arrows are a lot fatter here, that's because the actual amount of data returned at this step is a lot more than what was previously being returned at this step. But again don't get stuck on the amount of data, the bigger issue is if you look at the "Actual Number of Rows", it's returning 4,048,577 rows but under the "Estimated Number of Rows" it thought it was only going to be returning 1 row. This is where the cardinality estimate issue I mentioned is occurring. (These two numbers can be off by a little bit, but generally if they're off by a magnitude or more, than that's a clear problem with the cardinality estimate.)

One reason a cardinality estimate issue like this causes performance problems is because they cause the engine to request the wrong amount of hardware resources to serve the data. I.e. by under-estimating the number of rows this step will return by so much, it's could severely under-request Memory and CPU needed to process this operation.

One thing that can cause cardinality estimate issues to occur is when functions are used as predicates in the JOIN, WHERE, and HAVING clauses. After reviewing your query closer, I see that you're using SUBSTRING() to get a value that represents the JobId from the papercut.printer_usage_log table, and then joining to that value outside the subquery when you do ON (papercut.JobId = ricoh.JobId). This is likely the source of your cardinality estimate issue. If you have another way to obtain the JobId that doesn't involve using a function (or at least you can try an alternative function) then you might be able to resolve your issue.

You can try this logically equivalent query re-write to see if it gives you a better execution plan. The only downside is it's not a single-statement query anymore and leverages TempDB. I'm not sure what's ultimately consuming your query (View, Stored Procedure, Report / Application?) so there might be some limitations that prevent you from doing this, but if not, then this might solve your problem:

-- Materializing your logic for Ricoh and Papercut into their own TempTables
SELECT 
    a.SystemId as ClientId,
    a.ProcessInternalUid as JobId,
    a.price
INTO #Ricoh
FROM ricoh.printhistory AS a
WHERE a.Exportdate >= '2020-10-01'
    AND a.Exportdate <= '2020-12-12'

-- This materializes the SUBSTRING() function call's results to a TempTable so that when we join on JobId later on, we're not joining by a function call rather we're joining by the actual value itself
SELECT SUBSTRING(b.job_comment, 16, 36) AS JobId 
INTO #Papercut
FROM  papercut.printer_usage_log AS b
WHERE b.usage_date >= '2020-10-01'
    AND b.usage_date <= '2020-12-12'
    AND b.printed = 'Y'
GROUP BY SUBSTRING(b.job_comment, 16, 36)

-- Indexing the aforementioned temp tables
CREATE CLUSTERED INDEX IX_Ricoh_JobId ON #Ricoh (JobId)
CREATE CLUSTERED INDEX IX_Ricoh_JobId ON #Papercut (JobId)

-- Final select
SELECT 
    #Ricoh.CliendId,
    SUM(#Ricoh.price) AS cost
FROM #Ricoh
INNER JOIN #Papercut 
    ON (#Papercut.JobId = #Ricoh.JobId)
GROUP BY #Ricoh.CliendId

If you are able to use this multi-statement query, then let me know how it works out and you can also upload the new execution plan for our references.

I'm still analyzing the differences between your execution plans and will advise a solution soon.

To view the Execution Plan you can enable it in SQL Server Management Studio like so (and then run your query):

EDIT:

Best Answer

Related Solutions

Sql-server – Relative cost of two similar queries involving XML columns

Sql-server – Comparing two queries in SQL Server 2012

Related Question