Sql-server – Performance gap between WHERE IN (1,2,3,4) vs IN (select * from STRING_SPLIT(‘1,2,3,4′,’,’))

execution-planperformancequery-performancesql serverstring-splitting

I seem to be having a huge performance gap between using hard coded values for a SELECT IN, vs a STRING_SPLIT. The query plans are identical except for the last phase where the index seek is getting executed multiple times for the STRING_SPLIT code. The outcome is a CPU time of about 90000 vs about 15000 (according to dm_exec_query_stats), so the difference is enormous. I have posted the 2 plans here…

  1. Hardcoded Plan
  2. String Split Plan

The interesting thing is the query plan shows the cost as being pretty much the same, but in when I check dm_exec_query_stats the cost (last_worker_time) is massively different.

Here are 2 outputs from the query plan…

0x79DEAD79D1F149CD 16199 
select * 
from fn_get_samples(1) s 
where s.sample_id in 
    (2495,2496,2497,2498,2499,2500,2501,2502,2503,2504)

0x4A073840486B252C 86689 
select * 
from fn_get_samples(1) s 
where s.sample_id in 
    (select value as id 
     from 
     STRING_SPLIT('2495,2496,2497,2498,2499,2500,2501,2502,2503,2504',','))

The function code is…

CREATE FUNCTION [dbo].[fn_get_samples]
(
    @user_id int
)
RETURNS TABLE
AS
RETURN (
    -- get samples
    select s.sample_id,language_id,native_language_id,s.source_sentence,s.markup_sentence,s.latin_sentence,
        s.translation_source_sentence,s.translation_markup_sentence,s.translation_latin_sentence,
        isnull(sample_vkl.knowledge_level_id,1) as vocab_knowledge_level_id,
        isnull(sample_gkl.grammar_knowledge_level_id,0) as grammar_knowledge_level_id,
        s.polite_level_id,
        case when isnull(tr1.leitner_deck_index,0)=0 then 0 else cast((tr1.leitner_deck_index-1)  as float)/cast((max_leitner_deck_index-1) as float) end as progress_percentage,
        case when isnull(tr2.leitner_deck_index,0)=0 then 0 else cast((tr2.leitner_deck_index-1)  as float)/cast((max_leitner_deck_index-1) as float) end as listening_progress_percentage,
        case when f.object_id is null then 0 else 1 end as is_favorite,
        case when st.object_id is null then 0 else 1 end as is_studied,
        s.has_error,
        s.is_deleted,
        f.create_datetime as favorite_datetime,
        st.create_datetime as studied_datetime,
        s.create_user_id,
        s.create_datetime,
        isnull(s.modify_user_id,s.create_user_id) as modify_user_id,
        isnull(s.modify_datetime,s.create_datetime) as modify_datetime,
        s.display_order
        from samples s
        left outer join sample_knowledge_level_votes klv on klv.sample_id=s.sample_id and klv.user_id=@user_id
        left outer join favorites f on f.user_id=@user_id and f.object_type_id=(select object_type_id from object_types ot where ot.object_type_name='Pattern Sample') and f.object_id=s.sample_id
        left outer join studied st on st.user_id=@user_id and st.object_type_id=(select object_type_id from object_types ot where ot.object_type_name='Pattern Sample') and st.object_id=s.sample_id
        left outer join leitner_tracking tr1 on tr1.user_id=@user_id and tr1.object_type_id=(select object_type_id from object_types ot where ot.object_type_name='Pattern Sample') and tr1.object_id=s.sample_id and tr1.skill_type_id=(select skill_type_id from skill_types where skill_type_name=N'Guess Pronunciation from Meaning')
        left outer join leitner_tracking tr2 on tr2.user_id=@user_id and tr2.object_type_id=(select object_type_id from object_types ot where ot.object_type_name='Pattern Sample') and tr2.object_id=s.sample_id and tr2.skill_type_id=(select skill_type_id from skill_types where skill_type_name=N'Guess Meaning from Pronunciation')
        cross join (select max(leitner_deck_index) as max_leitner_deck_index from leitner_decks) dm
        left outer join vw_sample_user_grammar_kl sample_gkl on sample_gkl.user_id=@user_id and sample_gkl.sample_id=s.sample_id
        left outer join vw_sample_avg_kl sample_vkl on sample_vkl.sample_id=s.sample_id
        where is_deleted=0
)

It appears to be something to do with the 'vw_sample_avg_kl' join. If I comment out that join and the calculated column for 'vocab_knowledge_level_id' then the two query timings become very similar. Soon as I add it back it they differ greatly. Here is the code for that view…

CREATE VIEW [dbo].[vw_sample_avg_kl]
    AS 
    select sample_id,knowledge_level_id from (
        select sample_id,knowledge_level_id,count(*) as frequency,RANK() over (partition by sample_id order by count(*) desc,knowledge_level_id) as myrank
        from sample_knowledge_level_votes
        group by sample_id,knowledge_level_id
    ) sample_kl_ranking
    where myrank=1

The id field is INT. My two queries when shown in dm_exec_query_stats look like this…

0x4A073840486B252C  41096   select * from fn_get_samples(@user_id) s    where s.sample_id in (select * from STRING_SPLIT(@sample_id_list,','))

0x79DEAD79D1F149CD  7849    select * from fn_get_samples(1) s where s.sample_id in (2495,2496,2497,2498,2499,2500,2501,2502,2503,2504)

(slightly different example data set from above hence the slightly different timings, but you can see the huge gap in performance)

Best Answer

The biggest differences between a hard-coded IN list and the values generated by STRING_SPLIT are:

  1. The optimizer automatically sorts and removes duplicates from a literal IN list at compile time. This gives the optimizer accurate information about the number and distribution of values.
  2. Literal values can be embedded in a data access operator (e.g. an index seek). The results of the string split can be used to drive data access but a driving nested loops operator will be needed.
  3. The return type of the STRING_SPLIT function's value column is varchar or nvarchar depending on the input arguments. The length of the string value column is the same as the input string. Literal IN values are coerced to a compatible type at compile time if necessary.
  4. The number of rows returned by the STRING_SPLIT function is guessed, and the distribution of values is unknown.

In short, using a literal IN list gives the optimizer much better information to work with.