Sql-server – Emulate user-defined scalar function in a way which doesn’t prevent parallelism

functionsperformancequery-performancesql serversql-server-2016

I'm trying to see if there's a way to trick SQL Server to use a certain plan for the query.

1. Environment

Imagine you have some data which is shared between different processes. So, suppose we have some experiment results which take a lot of space. Then, for each process we know which year/month of experiment result we want to use.

if object_id('dbo.SharedData') is not null
    drop table SharedData

create table dbo.SharedData (
    experiment_year int,
    experiment_month int,
    rn int,
    calculated_number int,
    primary key (experiment_year, experiment_month, rn)
)
go

Now, for every process we have parameters saved in the table

if object_id('dbo.Params') is not null
    drop table dbo.Params

create table dbo.Params (
    session_id int,
    experiment_year int,
    experiment_month int,
    primary key (session_id)
)
go

2. Test data

Let's add some test data:

insert into dbo.Params (session_id, experiment_year, experiment_month)
select 1, 2014, 3 union all
select 2, 2014, 4 
go

insert into dbo.SharedData (experiment_year, experiment_month, rn, calculated_number)
select
    2014, 3, row_number() over(order by v1.name), abs(Checksum(newid())) % 10
from master.dbo.spt_values as v1
    cross join master.dbo.spt_values as v2
go

insert into dbo.SharedData (experiment_year, experiment_month, rn, calculated_number)
select
    2014, 4, row_number() over(order by v1.name), abs(Checksum(newid())) % 10
from master.dbo.spt_values as v1
    cross join master.dbo.spt_values as v2
go

3. Fetching results

Now, it's very easy to get experiment results by @experiment_year/@experiment_month:

create or alter function dbo.f_GetSharedData(@experiment_year int, @experiment_month int)
returns table
as
return (
    select
        d.rn,
        d.calculated_number
    from dbo.SharedData as d
    where
        d.experiment_year = @experiment_year and
        d.experiment_month = @experiment_month
)
go

The plan is nice and parallel:

select
    calculated_number,
    count(*)
from dbo.f_GetSharedData(2014, 4)
group by
    calculated_number

query 0 plan

enter image description here

4. Problem

But, to make usage of the data a bit more generic, I want to have another function – dbo.f_GetSharedDataBySession(@session_id int). So, straightforward way would be to create scalar functions, translating @session_id -> @experiment_year/@experiment_month:

create or alter function dbo.fn_GetExperimentYear(@session_id int)
returns int
as
begin
    return (
        select
            p.experiment_year
        from dbo.Params as p
        where
            p.session_id = @session_id
    )
end
go

create or alter function dbo.fn_GetExperimentMonth(@session_id int)
returns int
as
begin
    return (
        select
            p.experiment_month
        from dbo.Params as p
        where
            p.session_id = @session_id
    )
end
go

And now we can create our function:

create or alter function dbo.f_GetSharedDataBySession1(@session_id int)
returns table
as
return (
    select
        d.rn,
        d.calculated_number
    from dbo.f_GetSharedData(
        dbo.fn_GetExperimentYear(@session_id),
        dbo.fn_GetExperimentMonth(@session_id)
    ) as d
)
go

query 1 plan

enter image description here

The plan is the same except it's, of course, not parallel, because scalar functions performing data access make the whole plan serial.

So I've tried a several different approaches, like, using subqueries instead of scalar functions:

create or alter function dbo.f_GetSharedDataBySession2(@session_id int)
returns table
as
return (
    select
        d.rn,
        d.calculated_number
    from dbo.f_GetSharedData(
       (select p.experiment_year from dbo.Params as p where p.session_id = @session_id),
       (select p.experiment_month from dbo.Params as p where p.session_id = @session_id)
    ) as d
)
go

query 2 plan

enter image description here

Or using cross apply

create or alter function dbo.f_GetSharedDataBySession3(@session_id int)
returns table
as
return (
    select
        d.rn,
        d.calculated_number
    from dbo.Params as p
        cross apply dbo.f_GetSharedData(
            p.experiment_year,
            p.experiment_month
        ) as d
    where
        p.session_id = @session_id
)
go

query 3 plan

enter image description here

But I can't find a way to write this query to be as good as the one using scalar functions.

Couple of thoughts:

  1. Basically what I'd want is to being able to somehow tell SQL Server to pre-calculate certain values and then pass them further as constants.
  2. What could be helpful is if we had some intermediate materialization hint. I've checked a couple of variants (multi-statement TVF or cte with top), but no plan is as good as the one with scalar functions so far
  3. I know about coming improvement of SQL Server 2017 – Froid: Optimization of Imperative Programs in a Relational Database.I'm not sure it will help, though. It would've been nice to be proven wrong here, though.

Additional information

I am using a function (rather than selecting data directly from the tables) because it is much easier to use in many different queries, which usually have @session_id as a parameter.

I was asked to compare actual execution times. In this particular case

  • query 0 runs for ~500ms
  • query 1 runs for ~1500ms
  • query 2 runs for ~1500ms
  • query 3 runs for ~2000ms.

Plan #2 has an index scan instead of a seek, which is then filtered by predicates on nested loops. Plan #3 is not that bad, but still does more work and works slower that plan #0.

Let's assume that dbo.Params is changed rarely, and usually have around 1-200 rows, not more than, let's say 2000 is ever expected. It's around 10 columns now and I don't expect to add column too often.

The number of rows in Params is not fixed, so for every @session_id there'll be a row. Number of columns there is not fixed, it's one of the reasons I don't want to call dbo.f_GetSharedData(@experiment_year int, @experiment_month int) from everywhere, so I can add new column to this query internally.
I'd be glad to hear any opinions/suggestions on this, even if it has some restrictions.

Best Answer

You cannot really safely achieve exactly what you want in SQL Server today, i.e. in a single statement and with parallel execution, within the restrictions laid out in the question (as I perceive them).

So my simple answer is no. The rest of this answer is mostly a discussion of why that is, in case it is of interest.

It is possible to get a parallel plan, as noted in the question, but there are two main varieties, neither of which are suitable for your needs:

  1. A correlated nested loops join, with a round-robin distribute streams on the top level. Given that a single row is guaranteed to come from Params for a specific session_id value, the inner side will run on a single thread, even though it is marked with the parallelism icon. This is why the apparently-parallel plan 3 does not perform as well; it is in fact serial.

  2. The other alternative is for independent parallelism on the inner side of the nested loops join. Independent here means that threads are started up on the inner side, and not merely the same thread(s) as are executing the outer side of the nested loops join. SQL Server only supports independent inner-side nested loops parallelism when there is guaranteed to be one outer-side row and there are no correlated join parameters (plan 2).

So, we have a choice of a parallel plan that is serial (due to one thread) with the desired correlated values; or an inner-side parallel plan that has to scan because it has no parameters to seek with. (Aside: It really ought to be allowed to drive inner-side parallelism using exactly one set of correlated parameters, but it has never been implemented, probably for good reason).

A natural question then is: why do we need correlated parameters at all? Why can SQL Server not simply seek directly to the scalar values provided by e.g. a subquery?

Well, SQL Server can only 'index seek' using simple scalar references, e.g. a constant, variable, column, or expression reference (so a scalar function result can also qualify). A subquery (or other similar construction) is simply too complex (and potentially unsafe) to push into the storage engine whole. So, separate query plan operators are required. This is turn requires correlation, which means no parallelism of the sort you want.

All in all, there really is no better solution currently than methods like assigning the lookup values to variables and then using those in the function parameters in a separate statement.

Now you may have specific local considerations that means caching the current values of the year and month in SESSION_CONTEXT is worthwhile i.e.:

SELECT FGSD.calculated_number, COUNT_BIG(*)
FROM dbo.f_GetSharedData
(
    CONVERT(integer, SESSION_CONTEXT(N'experiment_year')), 
    CONVERT(integer, SESSION_CONTEXT(N'experiment_month'))
) AS FGSD
GROUP BY FGSD.calculated_number;

But this falls into the category of workaround.

On the other hand, if aggregation performance is of primary importance, you could consider sticking with inline functions and creating a columnstore index (primary or secondary) on the table. You may find the benefits of columnstore storage, batch mode processing, and aggregate pushdown provide greater benefits than a row-mode parallel seek anyway.

But beware of scalar T-SQL functions, especially with columnstore storage, since it is easy to end up with the function being evaluated per-row in a separate row-mode Filter. It is generally quite tricky to guarantee the number of times SQL Server will choose to evaluate scalars, and better not to try.