Sql-server – Optimize select on subquery with COALESCE(…)

optimizationsql serversubqueryt-sql

I have a large view that I use from within an application. I think I've narrowed down my performance problem, but I'm unsure how to fix it. A simplified version of the view looks like this:

SELECT ISNULL(SEId + '-' + PEId, '0-0') AS Id,
   *,
   DATEADD(minute, Duration, EventTime) AS EventEndTime
FROM (
    SELECT se.SEId, pe.PEId,
        COALESCE(pe.StaffName, se.StaffName) AS StaffName, -- << Problem!
        COALESCE(pe.EventTime, se.EventTime) AS EventTime,
        COALESCE(pe.EventType, se.EventType) AS EventType,
        COALESCE(pe.Duration, se.Duration) AS Duration,
        COALESCE(pe.Data, se.Data) AS Data,
        COALESCE(pe.Field, se.Field) AS Field,
        pe.ThisThing, se.OtherThing
    FROM PE pe FULL OUTER JOIN SE se 
      ON pe.StaffName = se.StaffName
     AND pe.Duration = se.Duration
     AND pe.EventTime = se.EventTime
    WHERE NOT(pe.ThisThing = 1 AND se.OtherThing = 0)
) Z

That probably doesn't justify the whole reason for the query structure, but maybe gives you an idea–this view joins two very poorly designed tables that I don't have control over and tries to synthesize some information out of it.

So, since this is a view used from the application, while trying to optimize I wrap it in another SELECT, like this:

SELECT * FROM (
    -- … above code …
) Q
WHERE StaffName = 'SMITH, JOHN Q'

because the application is searching for specific staff members in the result.

The problem seems to be the COALESCE(pe.StaffName, se.StaffName) AS StaffName section, and that I'm selecting from the view on StaffName. If I change that to pe.StaffName AS StaffName or se.StaffName AS StaffName, the performance problems disappear (but see updated 2 below). But that won't do because one side or the other of the FULL OUTER JOIN could be missing, so one or the other field may be NULL.

Can I refactor this replacing the COALESCE(…) with something else, which will get rewritten down into the subquery?

Other notes:

I've already added some indexes to fix performance problems with the rest of the query–without the COALESCE it is very fast.
Somewhat to my surprise, looking at the execution plan does not raise any flags, even when the wrapping subquery and WHERE statement is included. My total subquery cost in the analyzer is 0.0065736. Hmph. It takes four seconds to execute.
Changing the application to query differently ~~(e.g. returning pe.StaffName AS PEStaffName, se.StaffName AS SEStaffName and doing WHERE PEStaffName = 'X' OR SEStaffName = 'X')~~ might work, but as a last resort–I'm really hoping I can optimize the view without having to resort to touching the application.
A stored procedure would probably make more sense for this, but the application is built with Entity Framework, and I could not figure out how to get it to play nice with a SP that returns a table type (another topic entirely).

Indexes

The indexes I've added so far look something like this:

CREATE NONCLUSTERED INDEX [IX_PE_EventTime]
ON [dbo].[PE] ([EventTime])
INCLUDE ([StaffName],[Duration],[EventType],[Data],[Field],[ThisThing])

CREATE NONCLUSTERED INDEX [IX_SE_EventTime]
ON [dbo].[SE] ([EventTime])
INCLUDE ([StaffName],[Duration],[EventType],[Data],[Field],[OtherThing])

Update

Hmm…I tried simulating the stricken change above, and it didn't help. I.e, before ) Z above, I added AND (pe.StaffName = 'SMITH, JOHN Q' OR se.StaffName = 'SMITH, JOHN Q'), but the performance is the same. Now I really don't know where to start.

Update 2

@ypercube 's comment on needing the full join made me realize that my synthesized query left out a probably important component. While, yes, I need the full join, the test I did above by dropping the COALESCE and testing just one side of the join for a non-null value would make the other side of the full join irrelevant, and the optimizer was probably using this fact to speed up the query. Also, I've updated the example to show that StaffName is actually one of the join keys–which probably has significant bearing on the question. I'm also now leaning toward his suggestion that breaking this into a three-way union instead of full join may be the answer, and will simplify the abundance of COALESCEs I'm doing anyway. Trying it now.

Best Answer

This was rather longshot but since the OP says it worked, I'm adding it as an answer (feel free to correct it if you find anything wrong).

Try to break the internal query into three parts (INNER JOIN, LEFT JOIN with WHERE IS NULL check, RIGHT JOIN with IS NULL check) and then UNION ALL the three parts. This has the following advantages:

The optimizer has less transformation options available for FULL joins than for (the more common) INNER and LEFT joins.
The Z derived table can be removed (you can do that anyway) from the view definition.
The NOT(pe.ThisThing = 1 AND se.OtherThing = 0) will be needed only on the INNER join part.
Minor improvement, the use COALESCE() will be minimal if any at all (I assumed that se.SEId and pe.PEId are not nullable. If more columns are not nullable, you'll be able to remove more COALESCE() calls.)
More important, the optimizer may push down any conditions in your queries that involve these columns (now that COALESCE() is not blocking the push.)
All the above will give the optimizer more options to transform/rewrite any query that uses the view so it may find an execution plan that indexes on the underlying tables can be used.

In all, the view can be written as:

SELECT 
    se.SEId + '-' + pe.PEId AS Id,
    se.SEId, pe.PEId,
    pe.StaffName, 
    pe.EventTime,
    COALESCE(pe.EventType, se.EventType) AS EventType,
    pe.Duration,
    COALESCE(pe.Data, se.Data) AS Data,
    COALESCE(pe.Field, se.Field) AS Field,
    pe.ThisThing, se.OtherThing,
    DATEADD(minute, pe.Duration, pe.EventTime) AS EventEndTime
FROM PE pe INNER JOIN SE se 
  ON pe.StaffName = se.StaffName
 AND pe.Duration = se.Duration
 AND pe.EventTime = se.EventTime
WHERE NOT (pe.ThisThing = 1 AND se.OtherThing = 0) 

UNION ALL

SELECT 
    '0-0',
    NULL, pe.PEId,
    pe.StaffName, 
    pe.EventTime,
    pe.EventType,
    pe.Duration,
    pe.Data,
    pe.Field,
    pe.ThisThing, NULL,
    DATEADD(minute, pe.Duration, pe.EventTime) AS EventEndTime
FROM PE pe LEFT JOIN SE se 
  ON pe.StaffName = se.StaffName
 AND pe.Duration = se.Duration
 AND pe.EventTime = se.EventTime
WHERE NOT (pe.ThisThing = 1)
  AND se.StaffName IS NULL

UNION ALL

SELECT 
    '0-0',
    se.SEId, NULL,
    se.StaffName, 
    se.EventTime,
    se.EventType,
    se.Duration,
    se.Data,
    se.Field,
    NULL, se.OtherThing, 
    DATEADD(minute, se.Duration, se.EventTime) AS EventEndTime
FROM PE pe RIGHT JOIN SE se 
  ON pe.StaffName = se.StaffName
 AND pe.Duration = se.Duration
 AND pe.EventTime = se.EventTime
WHERE NOT (se.OtherThing = 0)
  AND pe.StaffName IS NULL ;

UPDATE 2012-01-12 14:03 EDT

I refactored it again to make sure the readings keys and boards keys are combined correctly before retrieving the data from the readings table:

SELECT 
    readings.* 
FROM 
    ( 
        SELECT A.* FROM
        (
            SELECT boxsn FROM readings 
            WHERE (time >= 1325404800)  
            AND (time < 1326317400)  
            ORDER BY `time` ASC
        ) A
        LEFT JOIN
        (
            SELECT id AS boxsn
            FROM boards
            WHERE siteId = '1'
        ) B
        USING (boxsn)
        WHERE B.boxsn IS NOT NULL
    ) readings_keys 
    LEFT JOIN readings 
    USING (boxsn) 
;

How to get performance benefits from a view vs subquery

The case where you can achieve performance benefit using a view (or common table expression = "inline view") instead of a subquery is if you have to repeat the same subquery several times in your query.

If instead you can replace each subquery with the same view name (or named common table expression) then the optimizer knows it's the same thing and can either cache the result or restructure the execution path so that the result only needs to be queried once.

Indexes

Update

Update 2

Best Answer

Related Solutions

Mysql – Query performance with subquery and IN clause

UPDATE 2012-01-12 14:03 EDT

How to get performance benefits from a view vs subquery

Related Question