Sql-server – Rebuilding indexes for a partitioned table having 300 partitions

cardinality-estimatesindex-maintenancepartitioningsql serversql-server-2016

Scenario

Partitioned table is empty and I am loading data for 1 partition which has 180k rows. I disabled the indexes and loaded the data and rebuilt the indexes after data is loaded.

Issue

While examining the query plan of rebuilt indexes, I can see the 'Estimated number of rows' is 180k but 'Actual number of rows' is 300 partitions * 180,000 rows = 54 Million rows, even though I am loading data for only one partition.

enter image description here
enter image description here

Can you shed some light on this behavior and how to overcome this issue?

Microsoft SQL Server 2016 (SP2) (KB4052908) – 13.0.5026.0 (X64) Mar 18 2018 09:11:49 Copyright (c) Microsoft Corporation Enterprise Edition (64-bit) on Windows Server 2012 R2 Standard 6.3 (Build 9600: )

Best Answer

While examining the query plan of rebuilt indexes, I can see the 'Estimated number of rows' is 180k but 'Actual number of rows' is 300 partitions * 180,000 rows = 54 Million rows, even though I am loading data for only one partition.

Your maths are a little off here. The image provided shows an estimated 1,853,250 rows (not 180k) and a total of 674,582,272 rows (not 54 million) over 364 iterations (364 * 1,853,250 = 674,582,272).

Still, the question remains: Why would SQL Server read 674 million rows to rebuild a nonclustered index when the whole table only holds 1,853,250 rows?

The execution plan shown is a colocated join. It performs the following operations:

  1. The Constant Scan holds the partition numbers for each partition of the table.
  2. For each partition:
    1. Fully scan the table (Clustered Index Scan)
    2. Sort the rows into nonclustered index order (Sort)
    3. Filter the rows for the current partition (Filter)
    4. Insert the rows into the nonclustered index (Index Insert)

This is obviously a terribly inefficient way to go about things. But the optimizer is not crazy: the general idea is sound, the problem is that the Filter is applied too late. Normally, the 'current partition' predicate would be pushed down into the Clustered Index Scan, where it would appear as a seek to select the current partition.

How It Should Work

The usual parallel plan generated looks like this:

Normal parallel plan

Notice the lack of a Filter operator. Its predicate has been pushed into the scan:

Scan with seek

Yes, this is a scan with a seek. The idea is to fetch rows from only the current partition on each iteration of the loop, so each iteration builds one partition of the nonclustered index.

Your case

I managed to reproduce the execution plan shown in the question by using an float or bigint data type as the partitioning column. Some aspect of the cardinality estimation for the Filter (probably a guess) prevents the optimizer pushing the 'current partition' predicate past the Sort, and into the Clustered Index Scan.

The following demo creates a 300-partition table with a disabled nonclustered index, loads one million rows into a single partition, then rebuilds the nonclustered index. It uses the bigint data type:

DROP TABLE IF EXISTS dbo.Test;

IF EXISTS (SELECT * FROM sys.partition_schemes AS PS WHERE PS.[name] = N'PS')
    DROP PARTITION SCHEME PS;

IF EXISTS (SELECT * FROM sys.partition_functions AS PF WHERE PF.[name] = N'PF')
DROP PARTITION FUNCTION PF;
GO
CREATE PARTITION FUNCTION PF (bigint)
AS RANGE RIGHT 
FOR VALUES 
(
    0,
    1000000,2000000,3000000,4000000,5000000,6000000,7000000,8000000,9000000,10000000,
    11000000,12000000,13000000,14000000,15000000,16000000,17000000,18000000,19000000,
    20000000,21000000,22000000,23000000,24000000,25000000,26000000,27000000,28000000,
    29000000,30000000,31000000,32000000,33000000,34000000,35000000,36000000,37000000,
    38000000,39000000,40000000,41000000,42000000,43000000,44000000,45000000,46000000,
    47000000,48000000,49000000,50000000,51000000,52000000,53000000,54000000,55000000,
    56000000,57000000,58000000,59000000,60000000,61000000,62000000,63000000,64000000,
    65000000,66000000,67000000,68000000,69000000,70000000,71000000,72000000,73000000,
    74000000,75000000,76000000,77000000,78000000,79000000,80000000,81000000,82000000,
    83000000,84000000,85000000,86000000,87000000,88000000,89000000,90000000,91000000,
    92000000,93000000,94000000,95000000,96000000,97000000,98000000,99000000,100000000,
    101000000,102000000,103000000,104000000,105000000,106000000,107000000,108000000,109000000,
    110000000,111000000,112000000,113000000,114000000,115000000,116000000,117000000,118000000,119000000,
    120000000,121000000,122000000,123000000,124000000,125000000,126000000,127000000,128000000,129000000,
    130000000,131000000,132000000,133000000,134000000,135000000,136000000,137000000,138000000,139000000,
    140000000,141000000,142000000,143000000,144000000,145000000,146000000,147000000,148000000,149000000,
    150000000,151000000,152000000,153000000,154000000,155000000,156000000,157000000,158000000,159000000,
    160000000,161000000,162000000,163000000,164000000,165000000,166000000,167000000,168000000,169000000,
    170000000,171000000,172000000,173000000,174000000,175000000,176000000,177000000,178000000,179000000,
    180000000,181000000,182000000,183000000,184000000,185000000,186000000,187000000,188000000,189000000,
    190000000,191000000,192000000,193000000,194000000,195000000,196000000,197000000,198000000,199000000,
    200000000,201000000,202000000,203000000,204000000,205000000,206000000,207000000,208000000,209000000,
    210000000,211000000,212000000,213000000,214000000,215000000,216000000,217000000,218000000,219000000,
    220000000,221000000,222000000,223000000,224000000,225000000,226000000,227000000,228000000,229000000,
    230000000,231000000,232000000,233000000,234000000,235000000,236000000,237000000,238000000,239000000,
    240000000,241000000,242000000,243000000,244000000,245000000,246000000,247000000,248000000,249000000,
    250000000,251000000,252000000,253000000,254000000,255000000,256000000,257000000,258000000,259000000,
    260000000,261000000,262000000,263000000,264000000,265000000,266000000,267000000,268000000,269000000,
    270000000,271000000,272000000,273000000,274000000,275000000,276000000,277000000,278000000,279000000,
    280000000,281000000,282000000,283000000,284000000,285000000,286000000,287000000,288000000,289000000,
    290000000,291000000,292000000,293000000,294000000,295000000,296000000,297000000,298000000,299000000,
    300000000
);
CREATE PARTITION SCHEME PS
AS PARTITION PF
ALL TO ([PRIMARY]);
GO
CREATE TABLE dbo.Test
(
    c1 bigint NOT NULL,
    c2 bigint NOT NULL,

    CONSTRAINT [PK dbo.Test (c1)]
    PRIMARY KEY CLUSTERED (c1)
    ON PS (c1),

    INDEX [IX dbo.Test (c2)]
    NONCLUSTERED (c2)
    ON PS (c1)
)
ON PS (c1);
GO
ALTER INDEX [IX dbo.Test (c2)]
ON dbo.Test 
DISABLE;
GO
INSERT dbo.Test WITH (TABLOCKX)
(
    c1,
    c2
)
SELECT
    CONVERT(bigint, SV1.number * 1000 + SV2.number),
    CONVERT(bigint, (SV1.number * 1000 + SV2.number) * 2)
FROM master.dbo.spt_values AS SV1
CROSS JOIN master.dbo.spt_values AS SV2
WHERE
    SV1.[type] = N'P'
    AND SV2.[type] = N'P'
    AND SV1.number >= 0
    AND SV1.number < 1000
    AND SV2.number >= 0
    AND SV2.number < 1000;
GO
ALTER INDEX [IX dbo.Test (c2)]
ON dbo.Test
REBUILD;

The index rebuild plan takes around 90 seconds to run:

bigint data type plan

By contrast, when the partitioning column is an integer data type, the rebuild takes two seconds (with the efficient scan-with-a-seek plan previously shown).

This might be because the partition id is natively an integer (so no conversion is necessary in the filter predicate), but the details are unclear at this point.

A workaround

The above only reproduces for me if I use the default (modern) cardinality estimation model. It is not possible to specify a hint directly on an index rebuild statement, so the following uses documented trace flag 9481 to temporarily use the legacy (original) cardinality estimation model instead:

ALTER INDEX [IX dbo.Test (c2)]
ON dbo.Test 
DISABLE;

DBCC TRACEON (9481);

ALTER INDEX [IX dbo.Test (c2)]
ON dbo.Test
REBUILD;

DBCC TRACEOFF (9481);

This produces the optimal execution plan and completes in around two seconds.

The real solution

There is a better way to add or load data into a single partition of a table. The idea is to SWITCH the existing partition out into a standalone table, load the data into that, then SWITCH the partition back in:

DROP TABLE IF EXISTS dbo.TestSwitchP2;
GO
-- Table with constraints limiting values to the target partition
CREATE TABLE dbo.TestSwitchP2
(
    c1 float NOT NULL PRIMARY KEY,
    c2 float NOT NULL,

    CHECK (c1 >= 0 AND c1 < 1000000)
)
ON [PRIMARY];
GO
-- Switch existing rows into the working table
ALTER TABLE dbo.Test
SWITCH PARTITION 2
TO dbo.TestSwitchP2;
GO
-- Add new rows
INSERT dbo.TestSwitchP2 WITH (TABLOCKX)
(
    c1,
    c2
)
SELECT
    CONVERT(float, SV1.number * 1000 + SV2.number),
    CONVERT(float, (SV1.number * 1000 + SV2.number) * 2)
FROM master.dbo.spt_values AS SV1
CROSS JOIN master.dbo.spt_values AS SV2
WHERE
    SV1.[type] = N'P'
    AND SV2.[type] = N'P'
    AND SV1.number >= 0
    AND SV1.number < 1000
    AND SV2.number >= 0
    AND SV2.number < 1000;
GO
-- Create a compatible nonclustered index
CREATE INDEX i ON dbo.TestSwitchP2 (c2);
GO
-- Switch the partition back in to the main table
ALTER TABLE dbo.TestSwitchP2
SWITCH TO dbo.Test 
PARTITION 2;

Partition switching is a metadata-only operation, and typically completes instantly.

The implementation above loads the one million new rows in less than two seconds. Note there is no need to disable the nonclustered index on the main table.