Sql-server – Help required with query (join) optimization

optimizationperformancequery-performancesql server

I am working with a fact table source query and I have observed that performance of the query is pathetic. It has increased from 1:00 min to 6:30 min by just using a function in the select clause which converts the date format. It only has 7 tables joined on simple On condition (No crazy stuff).

Going forward I need to add couple of more tables to the join list. This will only make the performance way worse. I need to fine tune the current query before I starting adding to it.

Here is the query:

 SELECT [dbo].[dFK](oew.StartDate) AS StartDate, -- INTEGER DATE!
    [dbo].[dFK](oew.EndDate) AS EndDate,
    [dbo].[dFK](oew.EffectiveDate) AS EffectiveDate
FROM    OpenEnrollmentWindow oew
    INNER JOIN ProductYear py ON oew.OrganizationProductYearID = py.ID
    INNER JOIN Marketplace m ON py.MarketplaceID = m.ID
    INNER JOIN Organization o ON m.OrganizationID = o.ID
    INNER JOIN Consumer c ON c.OrganizationID = o.ID
    LEFT JOIN OpenEnrollmentWindowProduct oewp ON oew.ID = oewp.OrganizationOpenEnrollmentWindowID
    LEFT JOIN OpenEnrollmentWindowProductType oewpt ON oew.ID = oewpt.OrganizationOpenEnrollmentWindowID

Here is the definition of the function:

CREATE FUNCTION [dbo].[dFK]
(@dt as sql_variant)
RETURNS int
AS
BEGIN
    DECLARE @type varchar(128)  
    DECLARE @iDate int
    SET @type = CONVERT(varchar(128), SQL_VARIANT_PROPERTY(@dt, 'BaseType'))
    SET @iDate =
        CASE 
            WHEN @type = 'int' AND @dt >= 19000101 AND @dt <= 20451231 THEN CONVERT(int, @dt)
            WHEN @type = 'int' AND @dt < 19000101 OR @type = 'int' AND @dt > 20451231 THEN 1
            WHEN @dt IS NULL THEN 1
            WHEN (@dt < CAST('1900-01-01 00:00:00.000' AS DATETIME) OR @dt > CAST('2045-12-31 11:59:59.000' AS DATETIME)) AND @type = 'datetime' THEN 1
            WHEN (@dt < CAST('1900-01-01' AS DATE) OR @dt > CAST('2045-12-31' AS DATE)) AND @type = 'date' THEN 1
            ELSE FORMAT(CAST(@dt AS DATETIME2), 'yyyyMMdd')
        END
    RETURN @iDate
END

GO

This is used as a fact table source. The date is converted to avoid a reverse lookup against date dimension. Let's just say it has to be converted at server side only. It is spitting out some 6 million rows. Now I do understand that's quite a lot, and that's why I am seeking some query optimization suggestions here.

Best Answer

The problem is the use of scalar functions. These are executed once per reference per row, and the current internal implementation is such that this is very nearly as expensive as running a separate query per invocation (18 million times, for 6 million rows with three function references per row).

A quick solution is to convert the function to an in-line table-valued function. These are in-lined into the query text, much in the same way views are expanded before query optimization. So the first step is to translate the function to:

CREATE FUNCTION dbo.dFK_InLine
    (@dt as sql_variant)
RETURNS table
AS
RETURN
    SELECT
        ReturnValue =
        CASE 
            WHEN CA.datatype = 'int' 
                AND @dt >= 19000101 AND @dt <= 20451231
                THEN CONVERT(integer, @dt)
            WHEN (CA.datatype = 'int' AND @dt < 19000101) 
                OR (CA.datatype = 'int' AND @dt > 20451231)
                THEN 1
            WHEN @dt IS NULL 
            THEN 1
            WHEN (@dt < CAST('1900-01-01 00:00:00.000' AS DATETIME) 
                OR @dt > CAST('2045-12-31 11:59:59.000' AS DATETIME)) 
                AND CA.datatype = 'datetime' 
            THEN 1
            WHEN (@dt < CAST('1900-01-01' AS DATE) 
                OR @dt > CAST('2045-12-31' AS DATE)) 
                AND CA.datatype = 'date' 
            THEN 1
            ELSE FORMAT(CAST(@dt AS DATETIME2), 'yyyyMMdd')
        END
    FROM
    (
        VALUES
        (
            CONVERT(varchar(128), SQL_VARIANT_PROPERTY(@dt, 'BaseType'))
        )
    ) AS CA (datatype);

Then modify the source query to use it:

SELECT 
    SD.ReturnValue AS StartDate, -- INTEGER DATE!
    ED.ReturnValue AS EndDate,
    EFD.ReturnValue AS EffectiveDate
FROM OpenEnrollmentWindow oew
CROSS APPLY dbo.dFK_InLine(oew.StartDate) AS SD
CROSS APPLY dbo.dFK_InLine(oew.EndDate) AS ED
CROSS APPLY dbo.dFK_InLine(oew.EffectiveDate) AS EFD
INNER JOIN ProductYear py 
    ON oew.OrganizationProductYearID = py.ID
INNER JOIN Marketplace m 
    ON py.MarketplaceID = m.ID
INNER JOIN Organization o 
    ON m.OrganizationID = o.ID
INNER JOIN Consumer c 
    ON c.OrganizationID = o.ID
LEFT JOIN OpenEnrollmentWindowProduct oewp 
    ON oew.ID = oewp.OrganizationOpenEnrollmentWindowID
LEFT JOIN OpenEnrollmentWindowProductType oewpt 
    ON oew.ID = oewpt.OrganizationOpenEnrollmentWindowID;

All that said, this is still quite an...unusual strategy, particularly the use of sql_variant and the CASE logic. You might get better value from refactoring the design to use a strong types, and a more traditional model.

Related Solutions

Sql-server – How to row estimates be improved in order to reduce chances of spills to tempdb

I won't comment about spills, tempdb or hints because the query seems pretty simple to need that much consideration. I think SQL-Server's optimizer will do its job quite good, if there are indexes suited for the query.

And your splitting into two queries is good as it shows what indexes will be useful. The first part:

(select convert(bigint, Value) NodeId
 from Oav.ValueArray
 where PropertyId = 3331  
   and ObjectId = 3540233
   and Sequence = 2)

needs an index on (PropertyId, ObjectId, Sequence) including the Value. I'd make it UNIQUE to be safe. The query would throw error anyway during runtime if more than one rows were returned, so it's good to ensure in advance that this won't happen, with the unique index:

CREATE UNIQUE INDEX
    PropertyId_ObjectId_Sequence_UQ
  ON Oav.ValueArray
    (PropertyId, ObjectId, Sequence) INCLUDE (Value) ;

The second part of the query:

select Value
  from Oav.ValueArray
 where ObjectId = @a               
   and PropertyId = 2840

needs an index on (PropertyId, ObjectId) including the Value:

CREATE INDEX
    PropertyId_ObjectId_IX
  ON Oav.ValueArray
    (PropertyId, ObjectId) INCLUDE (Value) ;

If efficiency is not improved or these indexes were not used or there are still differences in row estimates appearing, then there would be need to look further into this query.

In that case, the conversions (needed from the EAV design and the storing of different datatypes in the same columns) are a probable cause and your solution of splitting (as @AAron Bertrand and @Paul White comment) the query into two parts seems natural and the way to go. A redesign so to have different datatypes in their respective columns might be another.

Sql-server – Conversion of a varchar data type to a datetime data type resulted in an out-of-range value

I checked your profile and saw that you are in the UK. If your sql server is set to use the dateformat dmy then that explains your issue. Without using the 'T' instead of the space in the datetime string, Sql Server won't recognize it as ISO8601 format.

Try this:

select count(*) 
  from dbo.profile 
  where [created] between convert(datetime,'2014-11-01T00:00:00.000') 
                      and convert(datetime,'2014-11-30T23:59:59.997');

Querying using dates and/or datetimes can be tricky, to make sure you are getting what you are looking for I recommend reading:

Aaron Bertrand's Bad habits to kick : mis-handling date / range queries
Robyn Page's SQL Server DATE/TIME Workbench

edit: to clarify the out of range value in your error message would be from interpreting the month as 30 and the day as 11.

Best Answer

Related Solutions

Sql-server – How to row estimates be improved in order to reduce chances of spills to tempdb

Sql-server – Conversion of a varchar data type to a datetime data type resulted in an out-of-range value

Related Question