Using T-SQL scalar functions will frequently lead to performance problems* because SQL Server makes a separate function call (using a whole new T-SQL context) for each row. In addition, parallel execution is disallowed for the whole query.
T-SQL scalar functions can also make it difficult to troubleshoot performance problems (whether those problems are caused by the function or not). The function appears as a 'black box' to the query optimizer: it is assigned a fixed low estimated cost, regardless of the actual content of the function.
See this and this for more on the pitfalls of scalar functions.
You will probably be better off using the new TRY_CONVERT function in SQL Server 2012:
SELECT
InsertID,
dt1 = TRY_CONVERT(smalldatetime, MangledDateTime1),
dt2 = TRY_CONVERT(smalldatetime, MangledDateTime2),
dt3 = TRY_CONVERT(smalldatetime, MangledDateTime3)
FROM dbo.RawData;
╔══════════╦═════════════════════╦═════════════════════╦═════════════════════╗
║ InsertID ║ dt1 ║ dt2 ║ dt3 ║
╠══════════╬═════════════════════╬═════════════════════╬═════════════════════╣
║ 1 ║ 2000-10-10 00:00:00 ║ NULL ║ NULL ║
║ 1 ║ NULL ║ 2013-06-30 00:00:00 ║ NULL ║
║ 1 ║ NULL ║ NULL ║ 2013-06-30 00:00:00 ║
╚══════════╩═════════════════════╩═════════════════════╩═════════════════════╝
After the edit to the question
I see the function contains some specific logic. You could still look to use TRY_CONVERT
as part of that, but you should definitely convert the scalar function to an in-line function. In-line functions (RETURNS TABLE
) use a single SELECT
statement and are expanded into the calling query and fully optimized in much the same way views are. It can be helpful to think of in-line functions as parameterized views.
For example, an approximate translation of the scalar function to an in-line version is:
CREATE FUNCTION dbo.CleanDate
(@UnformattedString varchar(12))
RETURNS TABLE
AS RETURN
SELECT Result =
-- Successful conversion or NULL after
-- workarounds applied in CROSS APPLY
-- clauses below
TRY_CONVERT(smalldatetime, ca3.string)
FROM
(
-- Logic starts here
SELECT
CASE
WHEN @UnformattedString IS NULL
THEN NULL
WHEN LEN(@UnformattedString) <= 1
THEN NULL
WHEN LEN(@UnformattedString) = 12
THEN LEFT(@UnformattedString, 8)
ELSE @UnformattedString
END
) AS Input (string)
CROSS APPLY
(
-- Next stage using result so far
SELECT
CASE
WHEN @UnformattedString = '20000000'
THEN '20790606'
ELSE Input.string
END
) AS ca1 (string)
CROSS APPLY
(
-- Next stage using result so far
SELECT CASE
WHEN LEFT(ca1.string, 2) = '00' THEN '20' + RIGHT(ca1.string, 6)
WHEN LEFT(ca1.string, 2) = '18' THEN '19' + RIGHT(ca1.string, 6)
WHEN LEFT(ca1.string, 2) = '19' THEN ca1.string
WHEN LEFT(ca1.string, 2) = '20' THEN ca1.string
WHEN LEN(ca1.string) <> 6 THEN '20' + RIGHT(ca1.string, 6)
ELSE ca1.string
END
) AS ca2 (string)
CROSS APPLY
(
-- Next stage using result so far
SELECT
CASE
WHEN TRY_CONVERT(integer, LEFT(ca2.string, 4)) > YEAR(GETDATE())
THEN '20790606'
WHEN YEAR(GETDATE()) - TRY_CONVERT(integer, LEFT(ca2.string, 4)) >= 100
THEN '20790606'
ELSE ca2.string
END
) AS ca3 (string);
The function used on the sample data:
SELECT
InsertID,
Result1 = CD1.Result,
Result2 = CD2.Result,
Result3 = CD3.Result
FROM dbo.RawData AS RD
CROSS APPLY dbo.CleanDate(RD.MangledDateTime1) AS CD1
CROSS APPLY dbo.CleanDate(RD.MangledDateTime2) AS CD2
CROSS APPLY dbo.CleanDate(RD.MangledDateTime3) AS CD3;
Output:
╔══════════╦═════════════════════╦═════════════════════╦═════════════════════╗
║ InsertID ║ Result1 ║ Result2 ║ Result3 ║
╠══════════╬═════════════════════╬═════════════════════╬═════════════════════╣
║ 1 ║ 2000-10-10 00:00:00 ║ 2079-06-06 00:00:00 ║ NULL ║
║ 1 ║ 2079-06-06 00:00:00 ║ 2013-06-30 00:00:00 ║ 2079-06-06 00:00:00 ║
║ 1 ║ 2000-10-10 00:00:00 ║ 2079-06-06 00:00:00 ║ 2013-06-30 00:00:00 ║
╚══════════╩═════════════════════╩═════════════════════╩═════════════════════╝
*CLR scalar functions have a much faster invocation path than T-SQL scalar functions and do not prevent parallelism.
As ypercube commented
No, if the query is what you show, he is totally wrong. It's pretty sargable as it is.
You can verify this by:
- Creating a simple test table with a [Date] column.
- Insert a large number of rows with varying dates.
- NOTE: In the above "large number" and "varying dates" is a precaution to ensure that your query is selective enough. Otherwise the optimiser may choose not to use your index in any case.
- Generate a query plan for your query (once with an index on the Date column, and once without).
- You can also use STATISTICS IO to show the difference.
- If you have enough test data, the difference will be easily observable.
Once you've got the evidence, I suggest you do take it up with the DBA. However, don't get into an argument. Just show the test and data that demonstrates the index is used.
The point is you don't want to be forced to bend over backwards to avoid non-existant issues.
Fortunately in the case you demonstrated it won't be a problem to move the functions outside the query. E.g.
DECLARE @FromDate date ='inputdate',
@ToDate date = DATEADD(day, 1, @FromDate)
In fact, the above may even be more maintainable in the long run.
However, there will come a time when you have something that cannot be trivially changed according to the DBA's wishes. Such as:
--Granted this probably belongs in a JOIN clause, but is primarily for illustrative purposes.
--Also the issue of sargability applies just as much to JOIN clauses as WHERE clauses
WHERE a.date >= b.date
AND a.date < DATEADD(day, 1, b.date)
The only way to get this function out of the WHERE clause would be to precalculate another column b.NextDay
. Which is exactly why you need the DBA's to understand Sargability correctly. I.e. that the above:
- Would be able to leverage an index on
a.date
.
- But not be able to leverage an index on
b.date
.
- So the most selective column/index should not have a function applied, but the other can.
- Attempting to hack a solution without a function in the WHERE clause will reduce both maintainability and performance.
If you really can't get buy-in from the DBA's perhaps the following will work and bypass their cargo-cult rules:
;WITH CTE_B AS (
SELECT b.Date, DATEADD(day, 1, b.DATE) AS NextDay
FROM b
)
SELECT ...
WHERE a.Date >= CTE_B.Date
AND a.Date < CTE_B.NextDay
The optimiser will almost certainly optimise this in the same way as if the function were in the WHERE
clause, so you shouldn't get a performance knock. But it's certainly an unnecessary reduction in maintainability.
Best Answer
You can rewrite it as an inline TVF returning a single column and row and CROSS APPLY it to get the benefits of inlining now (parallelism, no overhead of switching execution contexts, holistic query costing and optimisation) without having to wait for the work done on inlining of Scalar UDFs to get released.
So your function definition would be
With example usage
You don't really need the CASE as it will return NULL on NULL input anyway and may get better plans without it