Sql-server – the correct and performant pattern for loading (and querying) a large clustered columnstore table

columnstoreetlsql server

I am doing some test with clustered columnstore tables – both partitioned and not.
In my research, I've seen flavors of this pattern:

-- step 1: make new table
SELECT TOP 0 * INTO [Fact].[Order_CCI] FROM [Fact].[Order];

-- step 2: rowstore by date

CREATE CLUSTERED INDEX [CCI_Order] ON [Fact].[Order_CCI]
(
  [Order Date Key]
) WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF) 
ON [ps_OrderDate]([Order Date Key])

-- step 3: insert data

INSERT INTO [Fact].[Order_CCI] WITH(TABLOCK)
SELECT
  [Order Key]
  ,[City Key]
  ,[Customer Key]
  ,[Stock Item Key]
  ,[Order Date Key]
  ....

FROM [Fact].[Order]

-- step 4: replace rowstore with columnstore

CREATE CLUSTERED COLUMNSTORE INDEX [CCI_Order]
ON [Fact].[Order_CCI]
WITH (DROP_EXISTING = ON, COMPRESSION_DELAY = 0)

It uses a rowstore index to sort the data by the desired
column before switching it to a CCI.

But in other articles or posts, I see the presence of an explicit "ORDER BY"
in the insert statement:

INSERT INTO [Fact].[Order_CCI] WITH(TABLOCK)
SELECT
  [Order Key]
  ,[City Key]
  ,[Customer Key]
  ,[Stock Item Key]
  ,[Order Date Key]
  ....

FROM [Fact].[Order]
ORDER BY [Order Date Key]

I have these questions.

Is this the most performant code pattern for a large (500m row) table
where partitioning is employed? In contrast, this author
[https://janizajcbi.com/2018/09/14/row-store-to-column-store-story/]
populates the partitions in a loop.

Is the ORDER BY needed? I ran this pattern on a small (30m row) table both with and
without the ORDER BY and in both cases, the CCI metadata shows segments to be
date aligned/sorted. I would think that the ORDER BY is not needed.

But I just ran the same on the large table, and the segments are not
date aligned/sorted. But again note that it was a much larger table and was also partitioned.

Below is the SQL pattern that DID NOT partition but DID result in sorted segments

CREATE TABLE dbo.RiskExposure (
 ...
);
GO

INSERT INTO dbo.RiskExposure WITH (TABLOCK)
(
  ...
)
SELECT
...
FROM dbo.RiskExposureStage
WHERE Close_of_Bus_ > '2020-08-31'
GO

CREATE CLUSTERED INDEX CCI_RiskExposure
  ON dbo.RiskExposure ([Close_of_Bus_]);
GO

CREATE CLUSTERED COLUMNSTORE INDEX CCI_RiskExposure
  ON dbo.Risk WITH (MAXDOP = 1, DROP_EXISTING = ON);

Below is a more complete SQL for the test which I ran that DID partition but DID NOT result in sorted segments.

CREATE TABLE dbo.RiskExposure (
  ...
);
GO


INSERT INTO dbo.RiskExposure WITH (TABLOCK)
(
  ...
)
SELECT
  ...
FROM dbo.RiskExposureStage
WHERE Close_of_Bus_ > '2020-08-31' -- test with a subset
GO

CREATE CLUSTERED INDEX CCI_RiskExposure
    ON dbo.RiskExposure ( [Close_of_Bus_] )
        WITH (DATA_COMPRESSION = PAGE)
            ON ps_QuarterlyPartitionScheme3m( [Close_of_Bus_] ); 

CREATE CLUSTERED COLUMNSTORE INDEX CCI_RiskExposure
    ON dbo.RiskExposure
        WITH (DROP_EXISTING = ON)
            ON ps_QuarterlyPartitionScheme3m( [Close_of_Bus_] ); 

GO

ALTER INDEX CCI_RiskExposure
    ON dbo.RiskExposure
    REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON);

-- get stats

SELECT
    partitions.partition_number,
    column_store_segments.segment_id,
    column_store_segments.min_data_id,
    column_store_segments.max_data_id,
    column_store_segments.row_count
FROM sys.column_store_segments
  INNER JOIN sys.partitions
    ON column_store_segments.hobt_id = partitions.hobt_id
  INNER JOIN sys.indexes
    ON indexes.index_id = partitions.index_id
      AND indexes.object_id = partitions.object_id
  INNER JOIN sys.tables
    ON tables.object_id = indexes.object_id
  INNER JOIN sys.columns
    ON tables.object_id = columns.object_id
      AND column_store_segments.column_id = 
     columns.column_id
WHERE tables.name = 'RiskExposure'
  AND columns.name = 'Close_of_bus_'
ORDER BY tables.name, columns.name, partitions.partition_number, column_store_segments.segment_id;

Here's a subset of the stats for the test where the table was not partitioned.
Note that the rowgroups are packed (except for #35) and ordered.

partition_number    segment_id  min_data_id max_data_id row_count
1   0   737668  737673  1048576
1   1   737673  737677  1048576
1   2   737677  737680  1048576
1   3   737680  737684  1048576
1   4   737684  737686  1048576
1   5   737686  737689  1048576
1   6   737689  737691  1048576
1   7   737691  737693  1048576
1   8   737693  737695  1048576
....
1   37  737786  737789  1048576
1   38  737789  737794  1048576
1   39  737794  737799  742086
1   40  737799  737799  34421

Here are the stats for the test where partitions were employed. Note
that the rowgroups are not ordered. Note I'm showing just the middle
partition for brevity.

partition_number    segment_id  min_data_id max_data_id row_count

17  0   737698  737721  1048576
17  1   737698  737708  1048576
17  2   737698  737716  1048576
17  3   737698  737711  1048576
17  4   737712  737728  1048576
17  5   737716  737717  324458
17  6   737717  737744  1048576
17  7   737721  737723  1048576
17  8   737708  737768  1048576
17  9   737711  737788  1048576
17  10  737728  737734  1048576
17  11  737744  737747  1048576
17  12  737788  737789  643250
17  13  737723  737726  1048576
17  14  737768  737772  1048576
17  15  737734  737739  1048576
17  16  737747  737750  1048576
17  17  737726  737728  1048576
17  18  737772  737776  1048576
17  19  737728  737728  193276
17  20  737739  737742  1048576
17  21  737750  737752  1048576
17  22  737742  737743  314208
17  23  737776  737779  1048576
17  24  737752  737755  1048576
17  25  737779  737781  1048576
17  26  737755  737757  1048576
17  27  737781  737784  1048576
17  28  737757  737760  1048576
17  29  737784  737786  920474
17  30  737760  737764  686860
17  31  737716  737789  967708
17  32  737728  737743  507484

After re-running the query that does partitioning with the addition of MAXDOP=1, I am getting sorted segments just like in the non-partitioned test. So my error in leaving out that critical setting for the partitioned test. Now I am contemplating what is the correct logic for inserting new data. I often see this suggested: 1. create Staging Table 2. load Staging Table 3. add clustered index to Staging Table 4. INSERT INTO <Final Table> SELECT <list columns> FROM <Staging Table>

Best Answer

The ORDER BY on INSERT is not guaranteed to load the data in order. In particular the INSERT . . . SELECT can be parallelized, and the resulting table will not be loaded in any particular order.

Even single-threaded, the ORDER BY is ignored by INSERT ... SELECT. e g

drop table if exists test 
go
select *
into test 
from FactInternetSales where 1=0

create clustered columnstore index cci_test on test

insert into test --with (tablock)
select s.* from FactInternetSales s
order by ProductKey 

The insert doesn't have a sort operator:

enter image description here

This is documented here:

When used with a SELECT...INTO statement to insert rows from another source, the ORDER BY clause does not guarantee the rows are inserted in the specified order.

SELECT - ORDER BY Clause (Transact-SQL)

And if the table is big enough, you can check the column segments to see that they are not in fact ordered by ProdctKey.

select object_name(p.object_id) table_name, s.segment_id, c.name column_name,  s.min_data_id, s.max_data_id
from sys.column_store_segments s
join sys.partitions p
  on s.hobt_id = p.hobt_id 
left join sys.columns c
  on c.object_id = p.object_id 
  and c.column_id = s.column_id
where c.name = 'ProductKey'