I seem to be having a huge performance gap between using hard coded values for a SELECT IN
, vs a STRING_SPLIT
. The query plans are identical except for the last phase where the index seek is getting executed multiple times for the STRING_SPLIT
code. The outcome is a CPU time of about 90000 vs about 15000 (according to dm_exec_query_stats
), so the difference is enormous. I have posted the 2 plans here…
The interesting thing is the query plan shows the cost as being pretty much the same, but in when I check dm_exec_query_stats
the cost (last_worker_time
) is massively different.
Here are 2 outputs from the query plan…
0x79DEAD79D1F149CD 16199
select *
from fn_get_samples(1) s
where s.sample_id in
(2495,2496,2497,2498,2499,2500,2501,2502,2503,2504)
0x4A073840486B252C 86689
select *
from fn_get_samples(1) s
where s.sample_id in
(select value as id
from
STRING_SPLIT('2495,2496,2497,2498,2499,2500,2501,2502,2503,2504',','))
The function code is…
CREATE FUNCTION [dbo].[fn_get_samples]
(
@user_id int
)
RETURNS TABLE
AS
RETURN (
-- get samples
select s.sample_id,language_id,native_language_id,s.source_sentence,s.markup_sentence,s.latin_sentence,
s.translation_source_sentence,s.translation_markup_sentence,s.translation_latin_sentence,
isnull(sample_vkl.knowledge_level_id,1) as vocab_knowledge_level_id,
isnull(sample_gkl.grammar_knowledge_level_id,0) as grammar_knowledge_level_id,
s.polite_level_id,
case when isnull(tr1.leitner_deck_index,0)=0 then 0 else cast((tr1.leitner_deck_index-1) as float)/cast((max_leitner_deck_index-1) as float) end as progress_percentage,
case when isnull(tr2.leitner_deck_index,0)=0 then 0 else cast((tr2.leitner_deck_index-1) as float)/cast((max_leitner_deck_index-1) as float) end as listening_progress_percentage,
case when f.object_id is null then 0 else 1 end as is_favorite,
case when st.object_id is null then 0 else 1 end as is_studied,
s.has_error,
s.is_deleted,
f.create_datetime as favorite_datetime,
st.create_datetime as studied_datetime,
s.create_user_id,
s.create_datetime,
isnull(s.modify_user_id,s.create_user_id) as modify_user_id,
isnull(s.modify_datetime,s.create_datetime) as modify_datetime,
s.display_order
from samples s
left outer join sample_knowledge_level_votes klv on klv.sample_id=s.sample_id and klv.user_id=@user_id
left outer join favorites f on f.user_id=@user_id and f.object_type_id=(select object_type_id from object_types ot where ot.object_type_name='Pattern Sample') and f.object_id=s.sample_id
left outer join studied st on st.user_id=@user_id and st.object_type_id=(select object_type_id from object_types ot where ot.object_type_name='Pattern Sample') and st.object_id=s.sample_id
left outer join leitner_tracking tr1 on tr1.user_id=@user_id and tr1.object_type_id=(select object_type_id from object_types ot where ot.object_type_name='Pattern Sample') and tr1.object_id=s.sample_id and tr1.skill_type_id=(select skill_type_id from skill_types where skill_type_name=N'Guess Pronunciation from Meaning')
left outer join leitner_tracking tr2 on tr2.user_id=@user_id and tr2.object_type_id=(select object_type_id from object_types ot where ot.object_type_name='Pattern Sample') and tr2.object_id=s.sample_id and tr2.skill_type_id=(select skill_type_id from skill_types where skill_type_name=N'Guess Meaning from Pronunciation')
cross join (select max(leitner_deck_index) as max_leitner_deck_index from leitner_decks) dm
left outer join vw_sample_user_grammar_kl sample_gkl on sample_gkl.user_id=@user_id and sample_gkl.sample_id=s.sample_id
left outer join vw_sample_avg_kl sample_vkl on sample_vkl.sample_id=s.sample_id
where is_deleted=0
)
It appears to be something to do with the 'vw_sample_avg_kl' join. If I comment out that join and the calculated column for 'vocab_knowledge_level_id' then the two query timings become very similar. Soon as I add it back it they differ greatly. Here is the code for that view…
CREATE VIEW [dbo].[vw_sample_avg_kl]
AS
select sample_id,knowledge_level_id from (
select sample_id,knowledge_level_id,count(*) as frequency,RANK() over (partition by sample_id order by count(*) desc,knowledge_level_id) as myrank
from sample_knowledge_level_votes
group by sample_id,knowledge_level_id
) sample_kl_ranking
where myrank=1
The id
field is INT
. My two queries when shown in dm_exec_query_stats
look like this…
0x4A073840486B252C 41096 select * from fn_get_samples(@user_id) s where s.sample_id in (select * from STRING_SPLIT(@sample_id_list,','))
0x79DEAD79D1F149CD 7849 select * from fn_get_samples(1) s where s.sample_id in (2495,2496,2497,2498,2499,2500,2501,2502,2503,2504)
(slightly different example data set from above hence the slightly different timings, but you can see the huge gap in performance)
Best Answer
The biggest differences between a hard-coded
IN
list and the values generated bySTRING_SPLIT
are:IN
list at compile time. This gives the optimizer accurate information about the number and distribution of values.STRING_SPLIT
function'svalue
column isvarchar
ornvarchar
depending on the input arguments. The length of the stringvalue
column is the same as the input string. LiteralIN
values are coerced to a compatible type at compile time if necessary.STRING_SPLIT
function is guessed, and the distribution of values is unknown.In short, using a literal
IN
list gives the optimizer much better information to work with.