Sql-server – Getting Max for partition column in clustered columnstore index

columnstoresql serversql server 2014

  • SQL 2014 Enterprise
  • I have a datawarehouse table with ~ 800 million rows (total table size ~25 GB)
  • this table is partitioned by the import_date column (on daily basis)
  • it has a clustered columnstore index on it
  • when I execute a simple SELECT MAX(t.import_date) FROM stat.t_auftragsstatistik t it does an CLUSTERED COLUMNSTORE INDEX SCAN over ~150 million rows / 33156 pages
  • you can find the execution plan at https://www.brentozar.com/pastetheplan/?id=ryuskalug

On the other hand, when I SELECT COUNT(*) FROM stat.t_auftragsstatistik AS ta WHERE ta.import_date = '20170201' (which is the most recent date) it returns me ~375k rows and needs only 6 page read to gather this information.

Question:

  • why is the Max() so slow (~7 seconds) and reads so much data (with nonclustered index on import_date it would do a simple index scan in reverse order and stop after the first match (-> 3 reads))
  • why does it ignore the partitioning (even if I would partition on e.g. weekly basis, it would have only to scan the latest partition with at least one row)

Best Answer

To address your second question first, it looks like SQL Server may not have access to that optimization or may not always apply it as we would like it to. I created a yearly partitioned heap with 9 million rows and ran the same query as you. It would be nice if SQL Server looped backwards through the partition function and stopped when it found data in a partition. However, the query optimizer did a full table scan instead. I think that it's much more common to put an index on a partitioned column, so it would be natural to attribute a fast SELECT MAX(PARTITIONED_COLUMN) query to a partitioning scheme instead of an underlying index. If it would help I can post my code.

To address the first question, is 7 seconds really that slow to process 800 million rows? Your query runs in parallel so you should get batch mode processing. However, because you are on SQL Server 2014 you will not get aggregate pushdown. Sunil has a blog post here which works through a similar example.

The picture below shows and aggregate query that processes 10 million rows and computes a single aggregate SUM of Quantity sold from the SALES table

SELECT SUM (Quantity) FROM SALES

SQL Server 2014 scans these 10 million rows in batches (e.g. 900 rows) and sends these batches to Aggregate operator to compute the aggregate. The picture below shows 10 million rows moving from SCAN node to the Aggregate node.

In SQL Server 2016, the aggregate operator itself is pushed to the SCAN node (i.e. closer to the source of the data) and the aggregate is computed there for compressed rowgroups. The picture below shows that 0 rows moved from SCAN node to the AGGREGATE node. This is because the aggregate was computed at the SCAN node.

For the COUNT(*) query, you are explicitly filtering on a single partition. I would expect SQL Server to use partition pruning to only get the relevant rowgroups. Overall it should be faster and require less resources than the MAX() query does because that query will hit all partitions.

I suspect that you asked about the MAX() query mostly out of curiosity. The easiest way to make that query go faster is to upgrade to SQL Server 2016. For something more immediate, you could take a look at your rowgroups and make sure that they're of a good size. If you have fewer rowgroups I would expect the query to go somewhat faster. Alternatively, you could write a function to find the maximum value in a way that takes advantage of partitioning on the table.