Sql-server – How many partitions should I make for the clustered columnstore index tables? Should I partition the rowstore tables also

columnstorepartitioningsql serversql-server-2016

I have a data warehouse comprised of four clustered columnstore index tables (CCI) and nine rowstore tables. These tables are used only for analytics and the CCI data is inserted from staging tables every 15 minutes. I am looking to optimize query performance by adding partitions and sorting.

All queries of this data are predicated on an integer field with about 350 distinct values.The leftmost CCI has 100M records and 125 columns. There are three child CCIs that have that same integer field. CCI 2 has 15M records and 150 columns, CCI 3 and 4 both have about 30M records and 25 columns each.

Of these 350 distinct integers the distribution of record count in the leftmost table is as follows:

5% Greater than 1M
46% Greater than 100K
83% Greater than 10K

Additionally, there are nine other rowstore tables that also join to the CCIs. These have trickle inserts, are children of the CCIs, and they all contain the same integer field. These rowstores have similar or smaller record volumes, < 10 columns each, two contain LOBS, and two undergo mass-updates frequently (these updates are also predicated on the ID field).

How many partitions should I make?

Should I partition the rowstore tables also?

Are there important considerations I am overlooking?

Note regarding the "sorting" I mentioned earlier:

A date field in the leftmost CCI is often a secondary predicate in these queries, therefore I am looking into re-sorting that CCI by date every four weeks or so as maintenance. I will achieve this sort by dropping the CCI, adding a clustered rowstore index on the date, dropping that index, and then re-adding the CCI with MAXDOP=1. I am also looking at sorting the child CCIs by the join key to their parent.

Best Answer

Benefits of partitioning a CCI:

Query performance can be improved because a minimum level of rowgroup elimination is guaranteed, despite how the data gets loaded or modified. Most of the generic SQL Server partitioning guidance out there doesn't take this into account.
Improved flexibility with maintenance operations in that you can do rebuilds at the partition level or do reorgs at the partition level (after partition switching out). You can also send different partitions to different filegroups, but I need to caution you that doing so will almost never improve performance. Filegroups are a maintenance feature. Increasing the file count can improve performance sometimes. Depending on your storage setup you almost certainly want the data relevant to your queries to be spread over multiple files to improve I/O.
Partition elimination covers more scenarios than rowgroup elimination on the same column. For example, a filter of WHERE ID < 0 OR ID > 10 will not quality for rowgroup elimination but will qualify for partition elimination.
Looping by partition can be helpful when performing maintenance operations that require all rows to be changed. For example, suppose you're adding a new column to a table that can be derived from existing columns in that table. Partitioning allows you to efficiently split that work into batches if desired.

Downsides of partitioning a CCI:

Without maintenance the number of rows in delta rowgroups can dramatically increase. Consider an unpartitioned CCI that is loaded with parallel inserts at MAXDOP 8. At most you'll have 4194304 rows in the delta store. If the table is changed to have 50 partitions it's now possible to have 209715200 rows in the delta store.
Query plans for inserts and deletes into the columnstore may contain a sort operator as a child of the DML operator. If this sort cannot get enough memory you can end up with extreme performance degradations. I recommend only modifying one partition at a time if using parallel insert.
If you choose your partition function unwisely you could end up with partitions that are too small. Many people will point you to the 1048576 row limit for a rowgroup as the ideal size, but personally I consider the benefits of getting there to be overblown. You probably do want to avoid many tiny partitions if you can help it though.
If you have too many partitions in your table or your database then bad things might happen. Unfortunately, this isn't very well defined and it's hard to find a credible source for what "too many partitions" actually means. I've heard of and seen issues with query compilation times. There was a recent answer here about DBCC CHECKTABLE as well.

Applying the above to your scenario: with the row counts that you have you shouldn't run into any of the really bad cases. For query performance, some folks need really fast query execution times and they need to skip as many rowgroups as possible. Others just need a minimum level of rowgroup elimination because most of the work done in the query is outside of the columnstore scans. That makes it difficult for someone on the outside to give you a recommendation for the number of partitions. For the 100 million table, anything from 4-100 could be reasonable.

You could try testing some of your queries with different numbers of rows in the partitions to see how performance changes. That can be simulated by creating copies of the tables or by creating a partition function on one table with deliberate skewness and changing what ID you filter by. If you take what results in good enough query performance and verify that you won't have any issues with loading data then you should be good.

The rowstores aren't relevant to the question, or rather, they're a totally different question. Partitioning is not the right tool to improve performance on rowstore query. I've seen performance gains on systems just by partitioning columnstore tables and leaving the rowstore tables alone.

Related Solutions

How to sort the data fetched from horizontal partitioned tables

It sounds like your architecture has gone past partitioning and is sharded. If you have system software or an application framework in place which can handle sharding that is where this requirement should be implemented.

In order to get the correct sort sequence you will have to have the results from all involved shards in the same place at the same time. There is no way to get some results from one shard and then get more results from another shard and be able to guarantee any sort sequence the user may choose.

One way (as BriteSponge suggested) would be to define a view which spans all the shards. Let the DBMS deal with the complexities of marshalling the various shards' results sets. This would be my preferred solution if it can be managed. It separates the application from the DB implementation details.

If you have to write this yourself there is some hope in the fact you have paging in place. This limits the maximum number or rows you will show in the UI and, therefore, the maximum number of rows that have to be returned from each shard. From the date range you can determine which shards are required. To each shard you submit a query of the form

select <columns>
from <table>
where date between <start> and <end>
order by <user selection>
fetch first <number> rows only

The <number> will be the maximum number of rows you want to show in a single page in the UI. While it is unlikely they will all come from a single shard it is possible so that's what the limit must be.

If your version of Oracle does not support the final limiting clause the following will still work, but will be much less efficient.

Perform a multi-way merge on the resultsets. Stop when you reach the limit for one page in the UI. Discard the remainder of the resultsets.

When the user pages forward, re-submit the SELECT using standard paging techniques.

When the user chooses a different sort sequence amend the SQL to match and resubmit to each shard.

My preference would be to implement this server side in a stored procedure. This will keep application and storage as separate as possible. Also, rather than a multi-way merge, the various resultsets can all go into a single working table which is then sorted and the correct number of rows returned to the application.

If you do implement it client-side, rather than discarding the unused portion of the resultsets they could be cached. This which may reduce the load for subsequent pages at the cost of additional complexity.

Sql-server – Code creating clustered columnstore index while maintaining row order

A clustered columnstore index is fundamentally different from a clustered rowstore index. You may have noticed there is no key column specification for a clustered columnstore index. That's right: a clustered columnstore index is an index with no keys - all columns are 'included'.

The most intuitive description I have heard for a clustered columnstore index is to think of it as a column-oriented heap table (where the 'RID' is rowgroup_id, row_number).

If you need indexes to support direct ordering and/or point/small range selections, you can create updateable rowstore b-tree indexes on top of clustered columnstore in SQL Server 2016.

In many cases this is simply not necessary, since columnstore access and batch mode sorting is so fast. Many of the things people 'know' about rowstore performance need to be relearned for columnstore. Scans and hashes are good :)

That said, of course columnstore has a structure to its row groups (and metadata about min/max values in each segment), which can be useful in queries that can benefit from row group/segment elimination.

One important technique in this area is to first create a clustered rowstore index with the desired ordering, then create the clustered columnstore index using the WITH (DROP_EXISTING = ON, MAXDOP = 1) option. In your example:

CREATE [UNIQUE] CLUSTERED INDEX idx 
ON dbo.tab1_cstore (id, time)
WITH (MAXDOP = 1);

CREATE CLUSTERED COLUMNSTORE INDEX idx 
ON dbo.tab1_cstore
WITH (DROP_EXISTING = ON, MAXDOP = 1);

Care is needed to maintain the benefits of row group/segment elimination over time. Also, while columnstore is already implicitly partitioned by row group, but you can explicitly partition it as well.

I'm not 100% sure what you're looking to test, but it is true that the 'order' of values within a segment is determined by the compression algorithm. My point about creating the columnstore index with DROP_EXISTING is about the ordering of data flowing into the segment creation process, so that segments overall will be ordered in a particular way. Within the segment, all bets are off.

Best Answer

Related Solutions

How to sort the data fetched from horizontal partitioned tables

Sql-server – Code creating clustered columnstore index while maintaining row order

Related Question