SQL Server – Why Query Runtime Increases Dramatically When Made into Subquery

sql serversubquery

I'm working on a query that selects counts from 2 servers and joins those together. Here is what I have:

SELECT
    a.NAME
    , SUM(ISNULL(b.ttl,0)) AS [Total]
FROM TableFunction(@user, DEFAULT) a
-- Selecting NAME from Table1 takes <1s
LEFT JOIN (
    -- Sub-Query 1
    SELECT
        COUNT(*) ttl
        , NAME
    FROM Table2 c
    WHERE active=1
    GROUP BY NAME
    -- Sub-Query 1 (< 1s, 662 rows)
    UNION ALL
    -- Sub-Query 2
    SELECT
        COUNT(*) ttl
        , jt.NAME
    FROM OtherServer.dbase.dbo.Table3 d
    INNER JOIN OtherServer.dbase.dbo.JoiningTable jt ON d.ID=jt.ID
    WHERE ISNULL(d.DELETED,0)=0
    GROUP BY jt.NAME
    -- Sub-Query 2 (7s, 5576 rows)
    -- Full Sub-Query [1 UNION ALL 2] (7s, 6238 rows)
) b ON b.NAME=a.NAME
GROUP BY a.NAME
--Full query (3:44, 91 rows)

The Execution plan suggests that the problem comes from the remote query (Sub-Grid 2).

How can I improve the runtime of this? Why does a 7 second sub-query take over 3 minutes when joined on a basically instant query?

UPDATE:

After some commenting out random things in hopes of finding something useful, it appears to be specifically the WHERE ISNULL(d.DELETED,0)=0 bit. Removing that line improves query to 4s. Why would this help at all?

UPDATE 2:

Playing around more, it seems the fact that the first table is a table-valued function may have been important. Selecting it's result table into a temp table and using that temp table brings the query down to 8s. The table-valued function code is:

--Inputs
@user varchar(50)
, @includeDeleted BIT = 0

--Function
RETURNS @tbl table (ID int
  , NAME varchar(max)
  , DESCRIPTION varchar(max))
AS
BEGIN
INSERT INTO @tbl (ID, NAME, DESCRIPTION) (
    select jt.ID, jt.NAME, jt.DESCRIPTION
    from OtherServer.dbase.dbo.JoiningTable jt
    where jt.USER = (select top 1 USER
                            from OtherServer.dbase.dbo.USERS 
                            where user_name = @user)
    and 1=CASE WHEN @includeDeleted=1 THEN 1 ELSE 
        CASE WHEN ISNULL(jt.Deleted, 0) <> 1 THEN 1 ELSE 0 END
    END
)

INSERT INTO @tbl (ID, NAME, DESCRIPTION) (
    SELECT ID, name, ''
    FROM RandomTable rt
    INNER JOIN RandomJoinTable rjt ON rjt.ID = rt.ID AND rjt.user = @user
    LEFT JOIN @tbl t ON t.ID = rt.ID
    WHERE 1=CASE WHEN @includeDeleted=1 THEN 1 ELSE 
        CASE WHEN ISNULL(rt.Deleted,0)=0 THEN 1 ELSE 0 END
    END
    AND t.Name IS NULL
)

RETURN

Best Answer

Querying across databases introduces a lot of uncertainty to the query optimizer because it doesn't know much about the other end. For example, if you try to join two tables, it has no idea how big the other table is (10 rows? 1 million?), what is indexed, or how the data is organized. Should it try to use an index to get 10 rows or a full scan for 1 million?

Generally, what you want to do is minimize the amount of interaction or joining that needs to happen between the two. Is there a filter you can put on JoiningTable to reduce the number of rows? Can you structure it into two CTEs that only pull the results together at the end?

Related Solutions

SQL Server Optimization – Why SQL Server Runs Subquery for Each Row

The slow plan isn't calculating the MAX for each row in the outer query.

In fact it never explicitly calculates it at all.

It gives a plan similar to

WITH CTE
     AS (SELECT TOP(1) WITH TIES *
         FROM   SubqueryTest
         WHERE year IS NOT NULL
         ORDER  BY year desc)
SELECT month,
       count(*)
FROM   CTE
GROUP  BY month

Slow Plan (Estimated Row Counts)

enter image description here

You have a non covering index on year asc so it scans that backwards to get the rows in the first year (shows as a seek because of the implicit IS NOT NULL predicate).

Unfortunately it doesn't seem to differentiate between TOP 1 and TOP 1 WITH TIES when estimating row counts.

In this case it makes a huge difference. (estimated 2 key lookup vs actual 4,424,803) so you get an inappropriate plan.

Slow Plan (Actual Row Counts)

enter image description here

You could consider adding month into the index on year either as a key or included column to make the index covering. The benefit of adding it as a secondary key column would be that it could then feed into a stream aggregate without an additional sort (though for only 12 distinct values a hash aggregate would be fine anyway).

A non covering index on such a non selective column is really pretty useless for the vast majority of queries. The index is totally ignored by the "fast" plan which ends up doing a parallel scan on the whole table and evaluating the predicate on all 27,445,400 rows (in preference to performing the huge number of lookups).

enter image description here

MySQL Subquery Performance – Why IN (Subquery) is Slower than = (Subquery)

According to this answer:

This is a known problem in MySQL. Rumors say it will be fixed.

The problem is due to a missing optimization when a subquery using IN is incorrectly indentified as dependent subquery instead of an independent subquery.

This causes the subquery to be run a huge number of times, which is unnecessary.

Best Answer

Related Solutions

SQL Server Optimization – Why SQL Server Runs Subquery for Each Row

MySQL Subquery Performance – Why IN (Subquery) is Slower than = (Subquery)

Related Question