Sql-server – How many partitions should I make for the clustered columnstore index tables? Should I partition the rowstore tables also

columnstorepartitioningsql serversql-server-2016

I have a data warehouse comprised of four clustered columnstore index tables (CCI) and nine rowstore tables. These tables are used only for analytics and the CCI data is inserted from staging tables every 15 minutes. I am looking to optimize query performance by adding partitions and sorting.

All queries of this data are predicated on an integer field with about 350 distinct values.The leftmost CCI has 100M records and 125 columns. There are three child CCIs that have that same integer field. CCI 2 has 15M records and 150 columns, CCI 3 and 4 both have about 30M records and 25 columns each.

Of these 350 distinct integers the distribution of record count in the leftmost table is as follows:

  • 5% Greater than 1M
  • 46% Greater than 100K
  • 83% Greater than 10K

Additionally, there are nine other rowstore tables that also join to the CCIs. These have trickle inserts, are children of the CCIs, and they all contain the same integer field. These rowstores have similar or smaller record volumes, < 10 columns each, two contain LOBS, and two undergo mass-updates frequently (these updates are also predicated on the ID field).

How many partitions should I make?

Should I partition the rowstore tables also?

Are there important considerations I am overlooking?

Note regarding the "sorting" I mentioned earlier:

A date field in the leftmost CCI is often a secondary predicate in these queries, therefore I am looking into re-sorting that CCI by date every four weeks or so as maintenance. I will achieve this sort by dropping the CCI, adding a clustered rowstore index on the date, dropping that index, and then re-adding the CCI with MAXDOP=1. I am also looking at sorting the child CCIs by the join key to their parent.

Best Answer

Benefits of partitioning a CCI:

  1. Query performance can be improved because a minimum level of rowgroup elimination is guaranteed, despite how the data gets loaded or modified. Most of the generic SQL Server partitioning guidance out there doesn't take this into account.

  2. Improved flexibility with maintenance operations in that you can do rebuilds at the partition level or do reorgs at the partition level (after partition switching out). You can also send different partitions to different filegroups, but I need to caution you that doing so will almost never improve performance. Filegroups are a maintenance feature. Increasing the file count can improve performance sometimes. Depending on your storage setup you almost certainly want the data relevant to your queries to be spread over multiple files to improve I/O.

  3. Partition elimination covers more scenarios than rowgroup elimination on the same column. For example, a filter of WHERE ID < 0 OR ID > 10 will not quality for rowgroup elimination but will qualify for partition elimination.

  4. Looping by partition can be helpful when performing maintenance operations that require all rows to be changed. For example, suppose you're adding a new column to a table that can be derived from existing columns in that table. Partitioning allows you to efficiently split that work into batches if desired.

Downsides of partitioning a CCI:

  1. Without maintenance the number of rows in delta rowgroups can dramatically increase. Consider an unpartitioned CCI that is loaded with parallel inserts at MAXDOP 8. At most you'll have 4194304 rows in the delta store. If the table is changed to have 50 partitions it's now possible to have 209715200 rows in the delta store.

  2. Query plans for inserts and deletes into the columnstore may contain a sort operator as a child of the DML operator. If this sort cannot get enough memory you can end up with extreme performance degradations. I recommend only modifying one partition at a time if using parallel insert.

  3. If you choose your partition function unwisely you could end up with partitions that are too small. Many people will point you to the 1048576 row limit for a rowgroup as the ideal size, but personally I consider the benefits of getting there to be overblown. You probably do want to avoid many tiny partitions if you can help it though.

  4. If you have too many partitions in your table or your database then bad things might happen. Unfortunately, this isn't very well defined and it's hard to find a credible source for what "too many partitions" actually means. I've heard of and seen issues with query compilation times. There was a recent answer here about DBCC CHECKTABLE as well.

Applying the above to your scenario: with the row counts that you have you shouldn't run into any of the really bad cases. For query performance, some folks need really fast query execution times and they need to skip as many rowgroups as possible. Others just need a minimum level of rowgroup elimination because most of the work done in the query is outside of the columnstore scans. That makes it difficult for someone on the outside to give you a recommendation for the number of partitions. For the 100 million table, anything from 4-100 could be reasonable.

You could try testing some of your queries with different numbers of rows in the partitions to see how performance changes. That can be simulated by creating copies of the tables or by creating a partition function on one table with deliberate skewness and changing what ID you filter by. If you take what results in good enough query performance and verify that you won't have any issues with loading data then you should be good.

The rowstores aren't relevant to the question, or rather, they're a totally different question. Partitioning is not the right tool to improve performance on rowstore query. I've seen performance gains on systems just by partitioning columnstore tables and leaving the rowstore tables alone.