I am working on a datawarehouse. One of our staging tables that is refreshed every night has about 10 million rows. We are using a custom built ETL tool that I can't make too many changes to. The tool loads this staging table like this:
truncate stage_table;
insert into stage_table with (tablockx) (column1, column2, etc...)
exec load_stage_table @batch_id = @batch_input
The contents of load_stage_table
has some setup and a select statement. I can't share the exact code, but here is a basic example.
create table load_stage_table
(
@batch_id varchar(max) = null
)
as
-- <update batch_id in batch_table>
-- collect data
select
column1 = table1.column1,
column2 = table2.column2,
...
from table1
join table2
on table2.id = table1.table2_id
-- many more similar joins
The problem is that when I run the stored procedure as its meant to be run with our ETL tool, the run time is almost 30 minutes. However if I modify the stored procedure to have the insert statement on the inside, then it only takes 1 minute.
create table load_stage_table
(
@batch_id varchar(max) = null
)
as
-- <update batch_id in batch_table>
-- collect data
insert into stage_table with (tablockx) (column1, column2, etc...)
select
column1 = table1.column1,
column2 = table2.column2,
...
from table1
join table2
on table2.id = table1.table2_id
-- many more similar joins
After running this a few times both ways and examining the execution plans, it seems that parallelism is not used when the insert is outside the stored procedure.
Does loading the table from the return, outside of the stored procedure, prevent parallelism? Or is this an indicator that the select statement needs some query tuning?
Best Answer
INSERT...EXEC
might prohibit parallelism, but I highly doubt that is the main issue here. The problem is thatINSERT...EXEC
works differently thanINSERT...SELECT
. When inserting the results of a query (i.e. aSELECT
statement orOUTPUT
clause) into a table, the results are streamed into the table. If you are watching therow_count
andused_page_count
fromsys.dm_db_partition_stats
, you will seem them increase immediately as theINSERT...SELECT
starts.But, when inserting the results of an
EXEC
(typically a Stored Procedure but could be Dynamic SQL), the sub-process / Stored Procedure needs to fully complete before anything is inserted into the table (yes, I just tested this). If you are watching therow_count
andused_page_count
fromsys.dm_db_partition_stats
, you will seem them stay at0
(or whatever they started at) for a looooong time after theINSERT...EXEC
starts. This is not a recipe for success when the Stored Procedure returns millions of rows as those results need to be stored somewhere prior to them being inserted into the table.