Sql-server – Running functions in parallel

functionsperformancesql serversql-server-2012

This is for SQL Server 2012.

We have some import processes for FTP files that are picked up and read into a staging table, from there we massage/check the data before moving into production. One of the areas that are causing some issues is dates, some are valid, some are typos, some are just plain gibberish.

I have the following example table(s):

Create Table RawData
(
 InsertID int not null,
 MangledDateTime1 varchar(10) null,
 MangledDateTime2 varchar(10) null,
 MangledDateTime3 varchar(10) null
)

I also have a destination table(say in production)

Create Table FinalData
(
  PrimaryKeyID int not null, -- PK constraint here, ident
  ForeighKeyID int not null, -- points to InsertID of RawData
  ValidDateTime1 SmallDateTime null,
  ValidDateTime2 SmallDateTime null,
  ValidDateTime3 SmallDateTime null
)

I insert the following into the RawData table:

 Insert Into RawData(InsertID, MangledDateTime1, MangledDateTime2, MangledDateTime3)
 Values(1, '20001010', '20800630', '00000000') -- First is legit, second two are not
 Insert Into RawData(InsertID, MangledDateTime1, MangledDateTime2, MangledDateTime3)
 Values(1, '20800630', '20130630', '20000000') -- middle is legit, first/third are not
 Insert Into RawData(InsertID, MangledDateTime1, MangledDateTime2, MangledDateTime3)
 Values(1, '00001010', '00800630', '20130630') -- Last is legit, first two are not

I wrote a function dbo.CreateDate to address the issue(s). We try to clean the data as best we can (use NULL if we can't), then convert the data to the correct datatype (in this case smalldatetime).

Insert Into FinalData(ForeighKeyID , ValidDateTime1, ValidDateTime2, ValidDateTime3)
Select 
 InsertID
 ,dbo.CreateDate(MangledDateTime1)
 ,dbo.CreateDate(MangledDateTime2)
 ,dbo.CreateDate(MangledDateTime3)
From RawData

We're running into some performance problems with functions. I'm wondering if/how they work in parallel.

I'm assuming here that the function CreateDate is being run in parallel as each row inserts. Such that each column/value has it's "own" function and is running at the same time as it inserts.

But I could be wrong, is it running serial over each column in each row as it inserts?

CreateDate() code:

Alter Function dbo.CreateDate
(
@UnformattedString  varchar(12)
)
Returns smalldatetime
As
Begin
Declare @FormattedDate smalldatetime

If(@UnformattedString Is Not Null)
Begin
    Declare @MaxSmallDate varchar(8) = '20790606'


    -- We got gibberish
    If Len(@UnformattedString) = 1
    Begin
        return null
    End

    -- To account for date and time
    If Len(@UnformattedString) = 12
    Begin
        Select @UnformattedString = Substring(@UnformattedString, 0,9)
    End

    If @UnformattedString = '20000000'
    Begin
        Select @UnformattedSTring = @MaxSmallDate
    End

    -- Some people are sending us two digit years, won't parse right
    If Substring(@UnformattedString,0,3) = '00'
    Begin
        Select @UnformattedString = Replace(@UnformattedString, '00','20')
    End

    -- Some people are fat fingering in people born in 18??, so change to 19??
    If Substring(@UnformattedString,0,3) in ('18')
    Begin
        -- We only want to change the year '18', not day 18 
        SELECT @UnformattedString = STUFF(@UnformattedString, 
                           CHARINDEX('18', @UnformattedString), 2, '19')
    End

    -- We're getting gibberish
    If Substring(@UnformattedString,0,3) not in ('19','20') 
               And Len(@UnformattedString) != 6
    Begin
        Select @UnformattedString = Replace(@UnformattedString, 
                       Substring(@UnformattedString,0,3),'20')
    End

    -- If the 4 digit year is greater than current year, set to max date
    If Convert(int, Substring(@UnformattedString,0,5)) > Year(getdate())
    Begin
        Set @FormattedDate = CONVERT(smalldatetime,@MaxSmallDate,1)
    End
    -- If the 4 digit year is less than 100 years ago, set to max date
    Else If Year(getdate()) - Convert(int, Substring(@UnformattedString,0,5)) >= 100
    Begin
        Set @FormattedDate = CONVERT(smalldatetime,@MaxSmallDate,1)
    End
    Else -- valid date(we hope)
    Begin
        Set @FormattedDate = CONVERT(smalldatetime,@UnformattedString,1) 
    End

    
    
End

Return @FormattedDate
End
Go

Best Answer

Using T-SQL scalar functions will frequently lead to performance problems* because SQL Server makes a separate function call (using a whole new T-SQL context) for each row. In addition, parallel execution is disallowed for the whole query.

T-SQL scalar functions can also make it difficult to troubleshoot performance problems (whether those problems are caused by the function or not). The function appears as a 'black box' to the query optimizer: it is assigned a fixed low estimated cost, regardless of the actual content of the function.

See this and this for more on the pitfalls of scalar functions.

You will probably be better off using the new TRY_CONVERT function in SQL Server 2012:

SELECT
    InsertID,
    dt1 = TRY_CONVERT(smalldatetime, MangledDateTime1),
    dt2 = TRY_CONVERT(smalldatetime, MangledDateTime2),
    dt3 = TRY_CONVERT(smalldatetime, MangledDateTime3)
FROM dbo.RawData;

╔══════════╦═════════════════════╦═════════════════════╦═════════════════════╗
║ InsertID ║         dt1         ║         dt2         ║         dt3         ║
╠══════════╬═════════════════════╬═════════════════════╬═════════════════════╣
║        1 ║ 2000-10-10 00:00:00 ║ NULL                ║ NULL                ║
║        1 ║ NULL                ║ 2013-06-30 00:00:00 ║ NULL                ║
║        1 ║ NULL                ║ NULL                ║ 2013-06-30 00:00:00 ║
╚══════════╩═════════════════════╩═════════════════════╩═════════════════════╝

After the edit to the question

I see the function contains some specific logic. You could still look to use TRY_CONVERT as part of that, but you should definitely convert the scalar function to an in-line function. In-line functions (RETURNS TABLE) use a single SELECT statement and are expanded into the calling query and fully optimized in much the same way views are. It can be helpful to think of in-line functions as parameterized views.

For example, an approximate translation of the scalar function to an in-line version is:

CREATE FUNCTION dbo.CleanDate
    (@UnformattedString  varchar(12))
RETURNS TABLE
AS RETURN
SELECT Result =
    -- Successful conversion or NULL after
    -- workarounds applied in CROSS APPLY
    -- clauses below
    TRY_CONVERT(smalldatetime, ca3.string)
FROM
(
    -- Logic starts here
    SELECT        
        CASE
            WHEN @UnformattedString IS NULL
                THEN NULL
            WHEN LEN(@UnformattedString) <= 1
                THEN NULL
            WHEN LEN(@UnformattedString) = 12
                THEN LEFT(@UnformattedString, 8)
            ELSE @UnformattedString
        END
) AS Input (string)
CROSS APPLY
(
    -- Next stage using result so far
    SELECT 
        CASE 
            WHEN @UnformattedString = '20000000' 
            THEN '20790606' 
            ELSE Input.string
        END
) AS ca1 (string)
CROSS APPLY 
(
    -- Next stage using result so far
    SELECT CASE
        WHEN LEFT(ca1.string, 2) = '00' THEN '20' + RIGHT(ca1.string, 6)
        WHEN LEFT(ca1.string, 2) = '18' THEN '19' + RIGHT(ca1.string, 6)
        WHEN LEFT(ca1.string, 2) = '19' THEN ca1.string
        WHEN LEFT(ca1.string, 2) = '20' THEN ca1.string
        WHEN LEN(ca1.string) <> 6 THEN '20' + RIGHT(ca1.string, 6)
        ELSE ca1.string
    END
) AS ca2 (string)
CROSS APPLY
(
    -- Next stage using result so far
    SELECT
        CASE 
            WHEN TRY_CONVERT(integer, LEFT(ca2.string, 4)) > YEAR(GETDATE())
                THEN '20790606'
            WHEN YEAR(GETDATE()) - TRY_CONVERT(integer, LEFT(ca2.string, 4)) >= 100
                THEN '20790606'
            ELSE ca2.string
        END
) AS ca3 (string);

The function used on the sample data:

SELECT
    InsertID,
    Result1 = CD1.Result,
    Result2 = CD2.Result,
    Result3 = CD3.Result
FROM dbo.RawData AS RD
CROSS APPLY dbo.CleanDate(RD.MangledDateTime1) AS CD1
CROSS APPLY dbo.CleanDate(RD.MangledDateTime2) AS CD2
CROSS APPLY dbo.CleanDate(RD.MangledDateTime3) AS CD3;

Output:

╔══════════╦═════════════════════╦═════════════════════╦═════════════════════╗
║ InsertID ║       Result1       ║       Result2       ║       Result3       ║
╠══════════╬═════════════════════╬═════════════════════╬═════════════════════╣
║        1 ║ 2000-10-10 00:00:00 ║ 2079-06-06 00:00:00 ║ NULL                ║
║        1 ║ 2079-06-06 00:00:00 ║ 2013-06-30 00:00:00 ║ 2079-06-06 00:00:00 ║
║        1 ║ 2000-10-10 00:00:00 ║ 2079-06-06 00:00:00 ║ 2013-06-30 00:00:00 ║
╚══════════╩═════════════════════╩═════════════════════╩═════════════════════╝

*CLR scalar functions have a much faster invocation path than T-SQL scalar functions and do not prevent parallelism.