I am doing some test with clustered columnstore tables – both partitioned and not.
In my research, I've seen flavors of this pattern:
-- step 1: make new table
SELECT TOP 0 * INTO [Fact].[Order_CCI] FROM [Fact].[Order];
-- step 2: rowstore by date
CREATE CLUSTERED INDEX [CCI_Order] ON [Fact].[Order_CCI]
(
[Order Date Key]
) WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF)
ON [ps_OrderDate]([Order Date Key])
-- step 3: insert data
INSERT INTO [Fact].[Order_CCI] WITH(TABLOCK)
SELECT
[Order Key]
,[City Key]
,[Customer Key]
,[Stock Item Key]
,[Order Date Key]
....
FROM [Fact].[Order]
-- step 4: replace rowstore with columnstore
CREATE CLUSTERED COLUMNSTORE INDEX [CCI_Order]
ON [Fact].[Order_CCI]
WITH (DROP_EXISTING = ON, COMPRESSION_DELAY = 0)
It uses a rowstore index to sort the data by the desired
column before switching it to a CCI.
But in other articles or posts, I see the presence of an explicit "ORDER BY"
in the insert statement:
INSERT INTO [Fact].[Order_CCI] WITH(TABLOCK)
SELECT
[Order Key]
,[City Key]
,[Customer Key]
,[Stock Item Key]
,[Order Date Key]
....
FROM [Fact].[Order]
ORDER BY [Order Date Key]
I have these questions.
Is this the most performant code pattern for a large (500m row) table
where partitioning is employed? In contrast, this author
[https://janizajcbi.com/2018/09/14/row-store-to-column-store-story/]
populates the partitions in a loop.
Is the ORDER BY needed? I ran this pattern on a small (30m row) table both with and
without the ORDER BY and in both cases, the CCI metadata shows segments to be
date aligned/sorted. I would think that the ORDER BY is not needed.
But I just ran the same on the large table, and the segments are not
date aligned/sorted. But again note that it was a much larger table and was also partitioned.
Below is the SQL pattern that DID NOT partition but DID result in sorted segments
CREATE TABLE dbo.RiskExposure (
...
);
GO
INSERT INTO dbo.RiskExposure WITH (TABLOCK)
(
...
)
SELECT
...
FROM dbo.RiskExposureStage
WHERE Close_of_Bus_ > '2020-08-31'
GO
CREATE CLUSTERED INDEX CCI_RiskExposure
ON dbo.RiskExposure ([Close_of_Bus_]);
GO
CREATE CLUSTERED COLUMNSTORE INDEX CCI_RiskExposure
ON dbo.Risk WITH (MAXDOP = 1, DROP_EXISTING = ON);
Below is a more complete SQL for the test which I ran that DID partition but DID NOT result in sorted segments.
CREATE TABLE dbo.RiskExposure (
...
);
GO
INSERT INTO dbo.RiskExposure WITH (TABLOCK)
(
...
)
SELECT
...
FROM dbo.RiskExposureStage
WHERE Close_of_Bus_ > '2020-08-31' -- test with a subset
GO
CREATE CLUSTERED INDEX CCI_RiskExposure
ON dbo.RiskExposure ( [Close_of_Bus_] )
WITH (DATA_COMPRESSION = PAGE)
ON ps_QuarterlyPartitionScheme3m( [Close_of_Bus_] );
CREATE CLUSTERED COLUMNSTORE INDEX CCI_RiskExposure
ON dbo.RiskExposure
WITH (DROP_EXISTING = ON)
ON ps_QuarterlyPartitionScheme3m( [Close_of_Bus_] );
GO
ALTER INDEX CCI_RiskExposure
ON dbo.RiskExposure
REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON);
-- get stats
SELECT
partitions.partition_number,
column_store_segments.segment_id,
column_store_segments.min_data_id,
column_store_segments.max_data_id,
column_store_segments.row_count
FROM sys.column_store_segments
INNER JOIN sys.partitions
ON column_store_segments.hobt_id = partitions.hobt_id
INNER JOIN sys.indexes
ON indexes.index_id = partitions.index_id
AND indexes.object_id = partitions.object_id
INNER JOIN sys.tables
ON tables.object_id = indexes.object_id
INNER JOIN sys.columns
ON tables.object_id = columns.object_id
AND column_store_segments.column_id =
columns.column_id
WHERE tables.name = 'RiskExposure'
AND columns.name = 'Close_of_bus_'
ORDER BY tables.name, columns.name, partitions.partition_number, column_store_segments.segment_id;
Here's a subset of the stats for the test where the table was not partitioned.
Note that the rowgroups are packed (except for #35) and ordered.
partition_number segment_id min_data_id max_data_id row_count
1 0 737668 737673 1048576
1 1 737673 737677 1048576
1 2 737677 737680 1048576
1 3 737680 737684 1048576
1 4 737684 737686 1048576
1 5 737686 737689 1048576
1 6 737689 737691 1048576
1 7 737691 737693 1048576
1 8 737693 737695 1048576
....
1 37 737786 737789 1048576
1 38 737789 737794 1048576
1 39 737794 737799 742086
1 40 737799 737799 34421
Here are the stats for the test where partitions were employed. Note
that the rowgroups are not ordered. Note I'm showing just the middle
partition for brevity.
partition_number segment_id min_data_id max_data_id row_count
17 0 737698 737721 1048576
17 1 737698 737708 1048576
17 2 737698 737716 1048576
17 3 737698 737711 1048576
17 4 737712 737728 1048576
17 5 737716 737717 324458
17 6 737717 737744 1048576
17 7 737721 737723 1048576
17 8 737708 737768 1048576
17 9 737711 737788 1048576
17 10 737728 737734 1048576
17 11 737744 737747 1048576
17 12 737788 737789 643250
17 13 737723 737726 1048576
17 14 737768 737772 1048576
17 15 737734 737739 1048576
17 16 737747 737750 1048576
17 17 737726 737728 1048576
17 18 737772 737776 1048576
17 19 737728 737728 193276
17 20 737739 737742 1048576
17 21 737750 737752 1048576
17 22 737742 737743 314208
17 23 737776 737779 1048576
17 24 737752 737755 1048576
17 25 737779 737781 1048576
17 26 737755 737757 1048576
17 27 737781 737784 1048576
17 28 737757 737760 1048576
17 29 737784 737786 920474
17 30 737760 737764 686860
17 31 737716 737789 967708
17 32 737728 737743 507484
After re-running the query that does partitioning with the addition of MAXDOP=1
, I am getting sorted segments just like in the non-partitioned test. So my error in leaving out that critical setting for the partitioned test. Now I am contemplating what is the correct logic for inserting new data. I often see this suggested: 1. create Staging Table 2. load Staging Table 3. add clustered index to Staging Table 4. INSERT INTO <Final Table> SELECT <list columns> FROM <Staging Table>
Best Answer
The ORDER BY on INSERT is not guaranteed to load the data in order. In particular the INSERT . . . SELECT can be parallelized, and the resulting table will not be loaded in any particular order.
Even single-threaded, the ORDER BY is ignored by INSERT ... SELECT. e g
The insert doesn't have a sort operator:
This is documented here:
SELECT - ORDER BY Clause (Transact-SQL)
And if the table is big enough, you can check the column segments to see that they are not in fact ordered by ProdctKey.