Sql-server – Emulate user-defined scalar function in a way which doesn’t prevent parallelism

functionsperformancequery-performancesql serversql-server-2016

I'm trying to see if there's a way to trick SQL Server to use a certain plan for the query.

1. Environment

Imagine you have some data which is shared between different processes. So, suppose we have some experiment results which take a lot of space. Then, for each process we know which year/month of experiment result we want to use.

if object_id('dbo.SharedData') is not null
    drop table SharedData

create table dbo.SharedData (
    experiment_year int,
    experiment_month int,
    rn int,
    calculated_number int,
    primary key (experiment_year, experiment_month, rn)
)
go

Now, for every process we have parameters saved in the table

if object_id('dbo.Params') is not null
    drop table dbo.Params

create table dbo.Params (
    session_id int,
    experiment_year int,
    experiment_month int,
    primary key (session_id)
)
go

2. Test data

Let's add some test data:

insert into dbo.Params (session_id, experiment_year, experiment_month)
select 1, 2014, 3 union all
select 2, 2014, 4 
go

insert into dbo.SharedData (experiment_year, experiment_month, rn, calculated_number)
select
    2014, 3, row_number() over(order by v1.name), abs(Checksum(newid())) % 10
from master.dbo.spt_values as v1
    cross join master.dbo.spt_values as v2
go

insert into dbo.SharedData (experiment_year, experiment_month, rn, calculated_number)
select
    2014, 4, row_number() over(order by v1.name), abs(Checksum(newid())) % 10
from master.dbo.spt_values as v1
    cross join master.dbo.spt_values as v2
go

3. Fetching results

Now, it's very easy to get experiment results by @experiment_year/@experiment_month:

create or alter function dbo.f_GetSharedData(@experiment_year int, @experiment_month int)
returns table
as
return (
    select
        d.rn,
        d.calculated_number
    from dbo.SharedData as d
    where
        d.experiment_year = @experiment_year and
        d.experiment_month = @experiment_month
)
go

The plan is nice and parallel:

select
    calculated_number,
    count(*)
from dbo.f_GetSharedData(2014, 4)
group by
    calculated_number

query 0 plan

4. Problem

But, to make usage of the data a bit more generic, I want to have another function – dbo.f_GetSharedDataBySession(@session_id int). So, straightforward way would be to create scalar functions, translating @session_id -> @experiment_year/@experiment_month:

create or alter function dbo.fn_GetExperimentYear(@session_id int)
returns int
as
begin
    return (
        select
            p.experiment_year
        from dbo.Params as p
        where
            p.session_id = @session_id
    )
end
go

create or alter function dbo.fn_GetExperimentMonth(@session_id int)
returns int
as
begin
    return (
        select
            p.experiment_month
        from dbo.Params as p
        where
            p.session_id = @session_id
    )
end
go

And now we can create our function:

create or alter function dbo.f_GetSharedDataBySession1(@session_id int)
returns table
as
return (
    select
        d.rn,
        d.calculated_number
    from dbo.f_GetSharedData(
        dbo.fn_GetExperimentYear(@session_id),
        dbo.fn_GetExperimentMonth(@session_id)
    ) as d
)
go

query 1 plan

The plan is the same except it's, of course, not parallel, because scalar functions performing data access make the whole plan serial.

So I've tried a several different approaches, like, using subqueries instead of scalar functions:

create or alter function dbo.f_GetSharedDataBySession2(@session_id int)
returns table
as
return (
    select
        d.rn,
        d.calculated_number
    from dbo.f_GetSharedData(
       (select p.experiment_year from dbo.Params as p where p.session_id = @session_id),
       (select p.experiment_month from dbo.Params as p where p.session_id = @session_id)
    ) as d
)
go

query 2 plan

Or using cross apply

create or alter function dbo.f_GetSharedDataBySession3(@session_id int)
returns table
as
return (
    select
        d.rn,
        d.calculated_number
    from dbo.Params as p
        cross apply dbo.f_GetSharedData(
            p.experiment_year,
            p.experiment_month
        ) as d
    where
        p.session_id = @session_id
)
go

query 3 plan

But I can't find a way to write this query to be as good as the one using scalar functions.

Couple of thoughts:

Basically what I'd want is to being able to somehow tell SQL Server to pre-calculate certain values and then pass them further as constants.
What could be helpful is if we had some intermediate materialization hint. I've checked a couple of variants (multi-statement TVF or cte with top), but no plan is as good as the one with scalar functions so far
I know about coming improvement of SQL Server 2017 – Froid: Optimization of Imperative Programs in a Relational Database.I'm not sure it will help, though. It would've been nice to be proven wrong here, though.

Additional information

I am using a function (rather than selecting data directly from the tables) because it is much easier to use in many different queries, which usually have @session_id as a parameter.

I was asked to compare actual execution times. In this particular case

query 0 runs for ~500ms
query 1 runs for ~1500ms
query 2 runs for ~1500ms
query 3 runs for ~2000ms.

Plan #2 has an index scan instead of a seek, which is then filtered by predicates on nested loops. Plan #3 is not that bad, but still does more work and works slower that plan #0.

Let's assume that dbo.Params is changed rarely, and usually have around 1-200 rows, not more than, let's say 2000 is ever expected. It's around 10 columns now and I don't expect to add column too often.

The number of rows in Params is not fixed, so for every @session_id there'll be a row. Number of columns there is not fixed, it's one of the reasons I don't want to call dbo.f_GetSharedData(@experiment_year int, @experiment_month int) from everywhere, so I can add new column to this query internally.
I'd be glad to hear any opinions/suggestions on this, even if it has some restrictions.

Best Answer

You cannot really safely achieve exactly what you want in SQL Server today, i.e. in a single statement and with parallel execution, within the restrictions laid out in the question (as I perceive them).

So my simple answer is no. The rest of this answer is mostly a discussion of why that is, in case it is of interest.

It is possible to get a parallel plan, as noted in the question, but there are two main varieties, neither of which are suitable for your needs:

A correlated nested loops join, with a round-robin distribute streams on the top level. Given that a single row is guaranteed to come from Params for a specific session_id value, the inner side will run on a single thread, even though it is marked with the parallelism icon. This is why the apparently-parallel plan 3 does not perform as well; it is in fact serial.
The other alternative is for independent parallelism on the inner side of the nested loops join. Independent here means that threads are started up on the inner side, and not merely the same thread(s) as are executing the outer side of the nested loops join. SQL Server only supports independent inner-side nested loops parallelism when there is guaranteed to be one outer-side row and there are no correlated join parameters (plan 2).

So, we have a choice of a parallel plan that is serial (due to one thread) with the desired correlated values; or an inner-side parallel plan that has to scan because it has no parameters to seek with. (Aside: It really ought to be allowed to drive inner-side parallelism using exactly one set of correlated parameters, but it has never been implemented, probably for good reason).

A natural question then is: why do we need correlated parameters at all? Why can SQL Server not simply seek directly to the scalar values provided by e.g. a subquery?

Well, SQL Server can only 'index seek' using simple scalar references, e.g. a constant, variable, column, or expression reference (so a scalar function result can also qualify). A subquery (or other similar construction) is simply too complex (and potentially unsafe) to push into the storage engine whole. So, separate query plan operators are required. This is turn requires correlation, which means no parallelism of the sort you want.

All in all, there really is no better solution currently than methods like assigning the lookup values to variables and then using those in the function parameters in a separate statement.

Now you may have specific local considerations that means caching the current values of the year and month in SESSION_CONTEXT is worthwhile i.e.:

SELECT FGSD.calculated_number, COUNT_BIG(*)
FROM dbo.f_GetSharedData
(
    CONVERT(integer, SESSION_CONTEXT(N'experiment_year')), 
    CONVERT(integer, SESSION_CONTEXT(N'experiment_month'))
) AS FGSD
GROUP BY FGSD.calculated_number;

But this falls into the category of workaround.

On the other hand, if aggregation performance is of primary importance, you could consider sticking with inline functions and creating a columnstore index (primary or secondary) on the table. You may find the benefits of columnstore storage, batch mode processing, and aggregate pushdown provide greater benefits than a row-mode parallel seek anyway.

But beware of scalar T-SQL functions, especially with columnstore storage, since it is easy to end up with the function being evaluated per-row in a separate row-mode Filter. It is generally quite tricky to guarantee the number of times SQL Server will choose to evaluate scalars, and better not to try.

Related Solutions

Sql-server – Why does sql server need to convert count(*) result into int before comparing it with an int variable

The fact that you are comparing it against an integer variable is irrelevant.

The plan for COUNT always has an CONVERT_IMPLICIT(int,[ExprNNNN],0)) where ExprNNNN is the label for the expression representing the result of the COUNT.

My assumption has always been that the code for COUNT just ends up calling the same code as COUNT_BIG and the cast is necessary to convert the bigint result of that back down to int.

In fact COUNT_BIG(*) isn't even distinguished in the query plan from COUNT(*). Both show up as Scalar Operator(Count(*)).

COUNT_BIG(nullable_column) does get distinguished in the execution plan from COUNT(nullable_column) but the latter still gets an implicit cast back down to int.

Some evidence that this is the case is below.

WITH 
E1(N) AS 
(
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
)                                       -- 1*10^1 or 10 rows
, E2(N) AS (SELECT 1 FROM E1 a, E1 b)   -- 1*10^2 or 100 rows
, E4(N) AS (SELECT 1 FROM E2 a, E2 b)   -- 1*10^4 or 10,000 rows
, E8(N) AS (SELECT 1 FROM E4 a, E4 b)   -- 1*10^8 or 100,000,000 rows
, E16(N) AS (SELECT 1 FROM E8 a, E8 b)  -- 1*10^16 or 10,000,000,000,000,000 rows
, T(N) AS (SELECT TOP (2150000000) 
                  ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N FROM E16)
SELECT COUNT(CASE WHEN N < 2150000000 THEN 1 END)
FROM T 
OPTION (MAXDOP 1)

This takes about 7 minutes to run on my desktop and returns the following

Msg 8115, Level 16, State 2, Line 1
Arithmetic overflow error converting expression to data type int.
Warning: Null value is eliminated by an aggregate or other SET operation.

Which indicates that the COUNT must have continued on after an int would have overflowed (at 2147483647) and the last row (2150000000) was processed by the COUNT operator leading to the message about NULL being returned.

By way of comparison replacing the COUNT expression with SUM(CASE WHEN N < 2150000000 THEN 1 END) returns

Msg 8115, Level 16, State 2, Line 1
Arithmetic overflow error converting expression to data type int.

with no ANSI warning about NULL. From which I conclude the overflow happened in this case during the aggregation itself before row 2,150,000,000 was reached.

Sql-server – Why would call to scalar function inside a Table Value Function be slower than outside the TVF

Scalar functions are called once-per-row, when called as part of a query.

Consider the following example.

Create a new, blank database for our tests:

USE master;
IF EXISTS (SELECT 1 FROM sys.databases d WHERE d.name = 'mv')
BEGIN
    ALTER DATABASE mv SET SINGLE_USER WITH ROLLBACK IMMEDIATE;
    DROP DATABASE mv;
END
GO
CREATE DATABASE mv;
GO

Create a table, a multi-statement function, and a table-valued-function:

USE mv;
GO
CREATE TABLE dbo.t
(
    t_id int NOT NULL
        CONSTRAINT PK_t
        PRIMARY KEY CLUSTERED
);
GO

CREATE FUNCTION dbo.t_func
(
    @t_id int
)
RETURNS bit
WITH SCHEMABINDING
AS
BEGIN
    DECLARE @r bit;
    IF EXISTS (SELECT 1 FROM dbo.t WHERE t.t_id = @t_id)
        SET @r = 1
    ELSE
        SET @r = 0;
    RETURN @r;
END
GO

CREATE FUNCTION dbo.t_tvf
(
    @min_t_id int
    , @max_t_id int
)
RETURNS TABLE 
WITH SCHEMABINDING
AS
RETURN (
    SELECT t_id = t.t_id
        , e = dbo.t_func(dbo.t.t_id)
    FROM dbo.t
    WHERE t.t_id >= @min_t_id
        AND t.t_id <= @max_t_id
);
GO

Insert some sample data into the table:

INSERT INTO dbo.t (t_id)
SELECT ROW_NUMBER() OVER (ORDER BY c.id, c.colid)
FROM sys.syscolumns c;
GO

Create a table to store function execution stats, and populate it with a start-row showing execution counts for the multi-statement-function, t_func:

CREATE TABLE dbo.function_stats
(
    run_num int NOT NULL
    , object_name sysname NOT NULL
    , execution_count int NULL 
    , CONSTRAINT PK_function_stats
        PRIMARY KEY CLUSTERED (run_num, object_name)
);
GO
INSERT INTO dbo.function_stats (run_num, object_name, execution_count)
SELECT 1
    , o.name
    , COALESCE(fs.execution_count, 0)
FROM sys.objects o 
    LEFT JOIN sys.dm_exec_function_stats fs ON fs.object_id = o.object_id
WHERE o.name = 't_func';
GO

Run a query against the TVF:
```
SELECT t.*
FROM dbo.t_tvf(1, 2) t;
GO
```

Capture the execution stats now:

INSERT INTO dbo.function_stats (run_num, object_name, execution_count)
SELECT 2
    , o.name
    , COALESCE(fs.execution_count, 0)
FROM sys.objects o 
    LEFT JOIN sys.dm_exec_function_stats fs ON fs.object_id = o.object_id
WHERE o.name = 't_func';

The function stats results:

SELECT *
FROM dbo.function_stats fs
ORDER BY fs.run_num
    , fs.object_name;

╔═════════╦═════════════╦═════════════════╗
║ run_num ║ object_name ║ execution_count ║
╠═════════╬═════════════╬═════════════════╣
║       1 ║ t_func      ║               0 ║
║       2 ║ t_func      ║               2 ║
╚═════════╩═════════════╩═════════════════╝

As you can see, the multi-statement-function has execute twice, once per row for the source table accessed by the TVF.

I expect the mutli-statement-function is being called many, many times by the TVF, giving the impression that it is running slowly, whereas in fact it is simply being called many times.

Additional information

Best Answer

Related Solutions

Sql-server – Why does sql server need to convert count(*) result into int before comparing it with an int variable

Sql-server – Why would call to scalar function inside a Table Value Function be slower than outside the TVF

Related Question