This is for SQL Server 2012.
We have some import processes for FTP files that are picked up and read into a staging table, from there we massage/check the data before moving into production. One of the areas that are causing some issues is dates, some are valid, some are typos, some are just plain gibberish.
I have the following example table(s):
Create Table RawData
(
InsertID int not null,
MangledDateTime1 varchar(10) null,
MangledDateTime2 varchar(10) null,
MangledDateTime3 varchar(10) null
)
I also have a destination table(say in production)
Create Table FinalData
(
PrimaryKeyID int not null, -- PK constraint here, ident
ForeighKeyID int not null, -- points to InsertID of RawData
ValidDateTime1 SmallDateTime null,
ValidDateTime2 SmallDateTime null,
ValidDateTime3 SmallDateTime null
)
I insert the following into the RawData table:
Insert Into RawData(InsertID, MangledDateTime1, MangledDateTime2, MangledDateTime3)
Values(1, '20001010', '20800630', '00000000') -- First is legit, second two are not
Insert Into RawData(InsertID, MangledDateTime1, MangledDateTime2, MangledDateTime3)
Values(1, '20800630', '20130630', '20000000') -- middle is legit, first/third are not
Insert Into RawData(InsertID, MangledDateTime1, MangledDateTime2, MangledDateTime3)
Values(1, '00001010', '00800630', '20130630') -- Last is legit, first two are not
I wrote a function dbo.CreateDate
to address the issue(s). We try to clean the data as best we can (use NULL
if we can't), then convert the data to the correct datatype (in this case smalldatetime
).
Insert Into FinalData(ForeighKeyID , ValidDateTime1, ValidDateTime2, ValidDateTime3)
Select
InsertID
,dbo.CreateDate(MangledDateTime1)
,dbo.CreateDate(MangledDateTime2)
,dbo.CreateDate(MangledDateTime3)
From RawData
We're running into some performance problems with functions. I'm wondering if/how they work in parallel.
I'm assuming here that the function CreateDate
is being run in parallel as each row inserts. Such that each column/value has it's "own" function and is running at the same time as it inserts.
But I could be wrong, is it running serial over each column in each row as it inserts?
CreateDate() code:
Alter Function dbo.CreateDate
(
@UnformattedString varchar(12)
)
Returns smalldatetime
As
Begin
Declare @FormattedDate smalldatetime
If(@UnformattedString Is Not Null)
Begin
Declare @MaxSmallDate varchar(8) = '20790606'
-- We got gibberish
If Len(@UnformattedString) = 1
Begin
return null
End
-- To account for date and time
If Len(@UnformattedString) = 12
Begin
Select @UnformattedString = Substring(@UnformattedString, 0,9)
End
If @UnformattedString = '20000000'
Begin
Select @UnformattedSTring = @MaxSmallDate
End
-- Some people are sending us two digit years, won't parse right
If Substring(@UnformattedString,0,3) = '00'
Begin
Select @UnformattedString = Replace(@UnformattedString, '00','20')
End
-- Some people are fat fingering in people born in 18??, so change to 19??
If Substring(@UnformattedString,0,3) in ('18')
Begin
-- We only want to change the year '18', not day 18
SELECT @UnformattedString = STUFF(@UnformattedString,
CHARINDEX('18', @UnformattedString), 2, '19')
End
-- We're getting gibberish
If Substring(@UnformattedString,0,3) not in ('19','20')
And Len(@UnformattedString) != 6
Begin
Select @UnformattedString = Replace(@UnformattedString,
Substring(@UnformattedString,0,3),'20')
End
-- If the 4 digit year is greater than current year, set to max date
If Convert(int, Substring(@UnformattedString,0,5)) > Year(getdate())
Begin
Set @FormattedDate = CONVERT(smalldatetime,@MaxSmallDate,1)
End
-- If the 4 digit year is less than 100 years ago, set to max date
Else If Year(getdate()) - Convert(int, Substring(@UnformattedString,0,5)) >= 100
Begin
Set @FormattedDate = CONVERT(smalldatetime,@MaxSmallDate,1)
End
Else -- valid date(we hope)
Begin
Set @FormattedDate = CONVERT(smalldatetime,@UnformattedString,1)
End
End
Return @FormattedDate
End
Go
Best Answer
Using T-SQL scalar functions will frequently lead to performance problems* because SQL Server makes a separate function call (using a whole new T-SQL context) for each row. In addition, parallel execution is disallowed for the whole query.
T-SQL scalar functions can also make it difficult to troubleshoot performance problems (whether those problems are caused by the function or not). The function appears as a 'black box' to the query optimizer: it is assigned a fixed low estimated cost, regardless of the actual content of the function.
See this and this for more on the pitfalls of scalar functions.
You will probably be better off using the new TRY_CONVERT function in SQL Server 2012:
After the edit to the question
I see the function contains some specific logic. You could still look to use
TRY_CONVERT
as part of that, but you should definitely convert the scalar function to an in-line function. In-line functions (RETURNS TABLE
) use a singleSELECT
statement and are expanded into the calling query and fully optimized in much the same way views are. It can be helpful to think of in-line functions as parameterized views.For example, an approximate translation of the scalar function to an in-line version is:
The function used on the sample data:
Output:
*CLR scalar functions have a much faster invocation path than T-SQL scalar functions and do not prevent parallelism.