Sql-server – Why does subquery use parallelism and join doesn’t

join;sql serversubquery

Why does SQL server use parallelism when running this query which uses a subquery but it doesn't when using a join? The join version runs in serial and takes around 30 times longer to complete.

Join version: ~30secs

enter image description here

Subquery version: <1second

enter image description here

EDIT:
Xml versions of query plan:

JOIN version

SUBQUERY version

Best Answer

As already indicated in the comments it looks as though you need to update your statistics.

The estimated number of rows coming out of the join between location and testruns is hugely different between the two plans.

Join plan estimates: 1

Plan 1

Sub query plan estimates: 8,748

enter image description here

The actual number of rows coming out of the join is 14,276.

Of course it makes absolutely no intuitive sense that the join version should estimate that 3 rows should come from location and produce a single joined row whereas the sub query estimates that a single one of those rows will produce 8,748 from the same join but nonetheless I was able to reproduce this.

This seems to happen if there is no cross over between the histograms when the statistics are created. The join version assumes a single row. And the single equality seek of the sub query assumes the same estimated rows as an equality seek against an unknown variable.

The cardinality of testruns is 26244. Assuming that is populated with three distinct location ids then the following query estimates that 8,748 rows will be returned (26244/3)

declare @i int

SELECT *
FROM   testruns AS tr
WHERE  tr.location_id = @i

Given that the table locations only contains 3 rows it is easy (if we assume no foreign keys) to contrive a situation where the statistics are created and then the data is altered in a way that dramatically effects the actual number of rows returned but is insufficient to trip the auto update of stats and recompile threshold.

As SQL Server gets the number of rows coming out of that join so wrong all the other row estimates in the join plan are massively underestimated. As well as meaning that you get a serial plan the query also gets an insufficient memory grant and the sorts and hash joins spill to tempdb.

One possible scenario that reproduces the actual vs estimated rows shown in your plan is below.

CREATE TABLE location
  (
     id       INT CONSTRAINT locationpk PRIMARY KEY,
     location VARCHAR(MAX) /*From the separate filter think you are using max?*/
  )

/*Temporary ids these will be updated later*/
INSERT INTO location
VALUES      (101, 'Coventry'),
            (102, 'Nottingham'),
            (103, 'Derby')

CREATE TABLE testruns
  (
     location_id INT
  )

CREATE CLUSTERED INDEX IX ON testruns(location_id)

/*Add in 26244 rows of data split over three ids*/
INSERT INTO testruns
SELECT TOP (5984) 1
FROM   master..spt_values v1, master..spt_values v2
UNION ALL
SELECT TOP (5984) 2
FROM   master..spt_values v1, master..spt_values v2
UNION ALL
SELECT TOP (14276) 3
FROM   master..spt_values v1, master..spt_values v2

/*Create statistics. The location_id histograms don't intersect at all*/
UPDATE STATISTICS location(locationpk) WITH FULLSCAN;    
UPDATE STATISTICS testruns(IX) WITH FULLSCAN;

/* UPDATE location.id. Three row update is below recompile threshold*/
UPDATE location
SET    id = id - 100

Then running the following queries gives the same estimated vs actual discrepancy

SELECT *
FROM   testruns AS tr
WHERE  tr.location_id = (SELECT id
                         FROM   location
                         WHERE  location = 'Derby')

SELECT *
FROM   testruns AS tr
       JOIN location loc
         ON tr.location_id = loc.id
WHERE  loc.location = ( 'Derby' )

UPDATE 2012-01-12 14:03 EDT

I refactored it again to make sure the readings keys and boards keys are combined correctly before retrieving the data from the readings table:

SELECT 
    readings.* 
FROM 
    ( 
        SELECT A.* FROM
        (
            SELECT boxsn FROM readings 
            WHERE (time >= 1325404800)  
            AND (time < 1326317400)  
            ORDER BY `time` ASC
        ) A
        LEFT JOIN
        (
            SELECT id AS boxsn
            FROM boards
            WHERE siteId = '1'
        ) B
        USING (boxsn)
        WHERE B.boxsn IS NOT NULL
    ) readings_keys 
    LEFT JOIN readings 
    USING (boxsn) 
;

Sql-server – Why does SQL Server run a subquery for each row of the table it’s qualifying

The slow plan isn't calculating the MAX for each row in the outer query.

In fact it never explicitly calculates it at all.

It gives a plan similar to

WITH CTE
     AS (SELECT TOP(1) WITH TIES *
         FROM   SubqueryTest
         WHERE year IS NOT NULL
         ORDER  BY year desc)
SELECT month,
       count(*)
FROM   CTE
GROUP  BY month

Slow Plan (Estimated Row Counts)

enter image description here

You have a non covering index on year asc so it scans that backwards to get the rows in the first year (shows as a seek because of the implicit IS NOT NULL predicate).

Unfortunately it doesn't seem to differentiate between TOP 1 and TOP 1 WITH TIES when estimating row counts.

In this case it makes a huge difference. (estimated 2 key lookup vs actual 4,424,803) so you get an inappropriate plan.

Slow Plan (Actual Row Counts)

enter image description here

You could consider adding month into the index on year either as a key or included column to make the index covering. The benefit of adding it as a secondary key column would be that it could then feed into a stream aggregate without an additional sort (though for only 12 distinct values a hash aggregate would be fine anyway).

A non covering index on such a non selective column is really pretty useless for the vast majority of queries. The index is totally ignored by the "fast" plan which ends up doing a parallel scan on the whole table and evaluating the predicate on all 27,445,400 rows (in preference to performing the huge number of lookups).

enter image description here

Best Answer

Related Solutions

Mysql – Query performance with subquery and IN clause

UPDATE 2012-01-12 14:03 EDT

Sql-server – Why does SQL Server run a subquery for each row of the table it’s qualifying

Related Question