Sql-server – SQL Server columnstore metadata (pushdown) filter

columnstoreperformancesql serversql server 2014

AFAIK columnstore indexes store min/max values for segments (1 million rows), so this information can be used to prune not needed data.
This feature seems to work with equality predicates, but not with more complex cases like joins or subquery filters.
In SQL Server 2014, is it possible to use columnstore metadata filtering with joins or subqueries?

Below is my test case:

--create table generate_series with 67_108_864 rows (runs about 5 mins)

set nocount on;

create table generate_series(c bigint);
insert into generate_series select 1;

declare @i int = 26;
while @i != 0
begin
    insert into generate_series WITH (TABLOCK)
    select
        (select count(*) from generate_series)
        + row_number() over (order by (select null)) as c
    from generate_series;

    set @i = @i - 1;
end;


create nonclustered columnstore index cstore on generate_series(c);


--Test queries

-- 0 seconds (very fast)
select * from generate_series where c = 100;

-- 8 seconds (slow)
select 100 as c into #tmp;
select * from generate_series where c in (select c from #tmp);

Best Answer

The performance optimization that you are describing is known as rowgroup elimination. You can get rowgroup elimination with both queries in both SQL Server 2016 and SQL Server 2014 but SQL Server 2014 has additional restrictions.

Let's test in SQL Server 2016 first. The following query finishes very quickly:

select * 
from generate_series 
where c = 100
OPTION (MAXDOP 1);

SQL Server 2016 offers additional diagnostic information about rowgroup elimination through SET STATISTICS IO ON; so let's take advantage of that:

Table 'generate_series'. Segment reads 1, segment skipped 68.

SQL Server Execution Times: CPU time = 0 ms, elapsed time = 9 ms.

This query is also fast:

select * 
from generate_series 
where c in (select c from #tmp)
OPTION (MAXDOP 8);

Statistics output:

Table 'generate_series'. Segment reads 1, segment skipped 68.

SQL Server Execution Times: CPU time = 0 ms, elapsed time = 11 ms.

This query is fast as well:

select * 
from generate_series 
where c in (select c from #tmp)
OPTION (MAXDOP 1);

Table 'generate_series'. Segment reads 1, segment skipped 68.

SQL Server Execution Times: CPU time = 0 ms, elapsed time = 5 ms.

For each query SQL Server is able to skip 68 rowgroups. It only reads data from the rowgroup that contains an ID of 100. In the plan for the final query we can see that an optimized bitmap was applied:

That bitmap allows rowgroup elimination through a join.

In SQL Server 2014 I get similar performance results for the queries except for the last one that has a join and MAXDOP of 1. That one takes 21 seconds to run. Here is the query plan:

There is no bitmap and the scan isn't running in batch mode. We send back all 67 million rows to the hash join.

In SQL Server 2014 serial plans are not eligible for batch mode. If your query is running in serial for some reason that could explain the poor performance that you experienced. Note that the query that directly filters against c = 100 also runs in row mode but it still finishes quickly.

Best Answer

Related Solutions

Sql-server – Speed up INSERTs

Sql-server – Minimal Logging Conditions in SQL

Related Question