Sql-server – Code creating clustered columnstore index while maintaining row order

columnstoresql serversql-server-2016

I want to convert a rowstore table to a columnstore table by creating a clustered columnstore index. There are three columns in the table: id,time, and value.

The table is ordered by id and time before creating columnstore index; however, after creating columnstore index, the row order is messed up. I thought it might be due to the parallelism and added the maxdop = 1 option, but that didn't fix the problem. Can anyone help me with this?

Here is the code creating tables and indexs:

-- creating rowstore table
drop table if exists tab1_rstore
select id, time, value
into tab1_rstore
from tab0
order by id_loan, period
option(maxdop 1)

-- creating clustered index on rowstore table
create clustered index idx on tab1_rstore (id,time)

-- creating columnstore table 
select * 
into tab1_cstore
from tab1_rstore
    option(maxdop 1)

-- comparing the first two rows from these two tables
    select top 2 *
from tab1_rstore

    select top 2 *
from tab1_cstore

The screenshot of the query results:

Top two rows before c-index

-- creating clustered columnstore index
create clustered columnstore index idx on tab1_cstore 
with (maxdop = 1) 

-- comparing the top two rows again
select top 2 *
from tab1_rstore

select top 2 *
from tab1_cstore

The screenshot of the query results with columnstore index:

enter image description here

My understanding is that the order of rows is determined by the compression algorithm and there is nothing we can do about it, see the limitation and restriction in the document here with the following quote:

Cannot include the ASC or DESC keywords for sorting the index.
Columnstore indexes are ordered according to the compression
algorithms. Sorting would eliminate many of the performance benefits.

I'm using SQL Server 2016 Developer Edition on Windows 10 64bit.

Best Answer

A clustered columnstore index is fundamentally different from a clustered rowstore index. You may have noticed there is no key column specification for a clustered columnstore index. That's right: a clustered columnstore index is an index with no keys - all columns are 'included'.

The most intuitive description I have heard for a clustered columnstore index is to think of it as a column-oriented heap table (where the 'RID' is rowgroup_id, row_number).

If you need indexes to support direct ordering and/or point/small range selections, you can create updateable rowstore b-tree indexes on top of clustered columnstore in SQL Server 2016.

In many cases this is simply not necessary, since columnstore access and batch mode sorting is so fast. Many of the things people 'know' about rowstore performance need to be relearned for columnstore. Scans and hashes are good :)

That said, of course columnstore has a structure to its row groups (and metadata about min/max values in each segment), which can be useful in queries that can benefit from row group/segment elimination.

One important technique in this area is to first create a clustered rowstore index with the desired ordering, then create the clustered columnstore index using the WITH (DROP_EXISTING = ON, MAXDOP = 1) option. In your example:

CREATE [UNIQUE] CLUSTERED INDEX idx 
ON dbo.tab1_cstore (id, time)
WITH (MAXDOP = 1);

CREATE CLUSTERED COLUMNSTORE INDEX idx 
ON dbo.tab1_cstore
WITH (DROP_EXISTING = ON, MAXDOP = 1);

Care is needed to maintain the benefits of row group/segment elimination over time. Also, while columnstore is already implicitly partitioned by row group, but you can explicitly partition it as well.

I'm not 100% sure what you're looking to test, but it is true that the 'order' of values within a segment is determined by the compression algorithm. My point about creating the columnstore index with DROP_EXISTING is about the ordering of data flowing into the segment creation process, so that segments overall will be ordered in a particular way. Within the segment, all bets are off.