Sql-server – Eliminate filter operator before columnstore index scan operator

columnstoreindex-tuningsql serversql-server-2017t-sql

I have a large fact table with millions of rows called MyLargeFactTable,
and its a clustered column store table.

There is a composite primary key constraint on it there as well
(customer_id,location_id,order_date columns).

I also have a temp table #my_keys_to_filter_MyLargeFactTable,
with the very same 3 columns,
and it contains few thousand UNIQUE combination of these 3 key values.

The following query gives me back the desired result set

...
FROM #my_keys_to_filter_MyLargeFactTable AS t
JOIN dbo.MyLargeFactTable AS m
ON m.customer_id = t.customer_id
AND m.location_id = t.location_id
AND m.order_date = t.order_date

but i notice that the Index Scan Operator on the fact table returns more rows than it should (about a million) and feed it into a Filter operator, which further reduce the result set to the desired few thousand rows.

Index Scan operator reads way to much rows (they quite wide rows) increasing IO, and significantly slows down the whole query.

Are my parameters not sargable?

How could I remove the Filter operator and somehow force the Index Scan operator to read only the few thousand rows?

Table definitions:

create table #my_keys_to_filter_MyLargeFactTable 
(
customer_id varchar(96) not null,
location_id varchar(96) not null,
order_date date not null,
primary key clustered (customer_id,location_id,order_date)
)

create table MyLargeFactTable
(
customer_id varchar(96) not null,
location_id varchar(96) not null,
order_date date not null,
...
lot of wide decimal typed columns, and even large varchars
...
PRIMARY KEY NONCLUSTERED  (customer_id,location_id,order_date),
INDEX cci CLUSTERED COLUMNSTORE
)

Best Answer

How could I remove the Filter operator and somehow force the Index Scan operator to read only the few thousand rows?

The Filter operator is applying the bitmap built on the join columns at the hash join.

Of the three join predicates, only order_date has a data type that is supported for bitmap pushdown to the column store scan. If you look at the Predicate on the scan you should see this as something like:

PROBE([Opt_Bitmap1005],[dbo].[MyLargeFactTable].[order_date])

The remaining join predicates are strings and so will appear at the Filter as part of the full bitmap test:

PROBE([Opt_Bitmap1005],
    [dbo].[MyLargeFactTable].[customer_id],
    [dbo].[MyLargeFactTable].[location_id],
    [dbo].[MyLargeFactTable].[order_date])

Pushing (parts of) the bitmap test down into the column store scan is an optimization that is only available for data types that can fit in 64 bits (like date in your example). Note join bitmap pushdown is different from string predicate pushdown (e.g. pushing customer_id LIKE '%XYZ%').

There are several ways you could look to work around this limitation. Redesigning the schema such that long strings are moved to a dimension table and referenced using an integer key is one option.

Somewhat less intrusively, you might be able to add an integer checksum to the column store (sadly not as a persisted computed column) and temporary table, then add that into the join - e.g an integer computed from CHECKSUM(customer_id, location_id, order_date) or the like.

There would still be a Filter, but the bitmap would include the checksum column, which could be pushed into the scan. This ought to significantly reduce the number of rows passed into the Filter.

Related Solutions

Sql-server – clustered and covering index ignored on delete statement. Table scan occurs

The optimizer may find a scan more appropriate based on statistics on the duplicate index instead of statistics on the PK. You didn't define the duplicate index as UNIQUE, so getting the "good" or "bad" plan could be just a matter of which index metadata is used by the optimizer to produce the plan. Very hard to tell without the actual execution plan, though.

Sql-server – Why clustered index scan

This table is very small!

It has 20 rows of which 2 match the search condition. The table definition contains three columns and two indexes (which both support uniqueness constraints).

CREATE TABLE Person.ContactType(
    ContactTypeID int IDENTITY(1,1) NOT NULL,
    Name dbo.Name NOT NULL,
    ModifiedDate datetime NOT NULL,
    CONSTRAINT PK_ContactType_ContactTypeID PRIMARY KEY CLUSTERED(ContactTypeID),
    CONSTRAINT AK_ContactType_Name UNIQUE NONCLUSTERED(Name)
)

Running

SELECT index_type_desc,
       index_depth,
       page_count,
       avg_page_space_used_in_percent,
       avg_record_size_in_bytes
FROM   sys.dm_db_index_physical_stats(db_id(), 
                                      object_id('Person.ContactType'), 
                                      NULL, 
                                      NULL, 
                                      'DETAILED')

Shows both indexes only consist of a single leaf page with no upper level pages.

+--------------------+-------------+------------+--------------------------------+--------------------------+
|  index_type_desc   | index_depth | page_count | avg_page_space_used_in_percent | avg_record_size_in_bytes |
+--------------------+-------------+------------+--------------------------------+--------------------------+
| CLUSTERED INDEX    |           1 |          1 | 15.9130219915987               | 62.5                     |
| NONCLUSTERED INDEX |           1 |          1 | 13.1949592290586               | 51.5                     |
+--------------------+-------------+------------+--------------------------------+--------------------------+

Rows on each index page aren't necessarily in index key order but each page has a slot array with the offset of each row on the page. This is maintained in index order.

The nonclustered index covers two out of the three columns (Name as a key column and ContactTypeID as a row locator back to the base table) but is missing ModifiedDate.

You can use index hints to force the NCI seek as below

SELECT ct.*
FROM   Person.ContactType AS ct WITH (INDEX = AK_ContactType_Name)
WHERE  ct.Name LIKE 'Own%';

But you can see that under SQL Server's cost model this plan is given a higher estimated cost than the competing CI scan (roughly double).

enter image description here

The single page clustered index scan would just need to read all the 20 rows on the page, evaluate the predicate against them and return them.

The single page nonclustered index range seek might potentially be able to perform a binary search on the slot array to reduce the number of rows evaluated however the index does not cover the query so it would also need a potential IO to retrieve the CI page and then it would still need to locate the row with the missing column values on there (for each row returned by the NCI seek).

On my machine running 1 million iterations of the non clustered index plan took 15.245 seconds compared to 11.113 seconds for the clustered index plan. Whilst this is far from double the plan without the hint was measurably faster.

Even if the table was orders of magnitude larger however you may well still not get your expected plan with lookups.

SQL Server's costing model prefers sequential scans to random IO lookups and the "tipping point" between it choosing a scan of a covering index or a seek and lookups of a non covering one is often surprisingly low as discussed in Kimberley Tripp's blog post here.

It is certainly not out of the question that it would choose such a plan for a 10% selective predicate but the clustered index would likely need to be quite a lot wider than the NCI for it to do so.

Best Answer

Related Solutions

Sql-server – clustered and covering index ignored on delete statement. Table scan occurs

Sql-server – Why clustered index scan

Related Question