Sql-server – How to keep smaller transaction logs files while rebuilding index of very large table

bulk-insertcolumnstoresql serversql-server-2016transaction-log

I need all the help that I can get to get this index rebuild successful. Especially an expert advise on transaction log management.

Background

The target database is a DW database hosting 3 big tables in clustered columnstore index (CCIX). The biggest table called Analog. It holds ~37 billion rows and ~600 GB of high compressed data. Based on our initial estimate using smaller table, a 150 GB CCIX table could expand to 5 TB, so we provisioned additional 29 TB for this rebuild.

SQL Server Enterprise 2016
Recovery mode = Simple

Why do we have to do this?

In the first version of our ETL service, the data are not consolidated and sorted first before they are loaded to CCIX. This results to non-optimal use of CCIX as many row segments are not fully filled before they are compressed. An optimal segment must have compressed 1,0485,76 rows and must be sorted in Time so time-based queries can get the most row-group elimination in sql server and less segments to process. We’re learn more CCIX as we go and as more documentatons are available.

Excerpt of my alignment is here
https://drive.google.com/file/d/0BzGLNskaj70UQUtZYW9CZF9iUUk/view?usp=sharing

The new ETL guarantees that data are consolidated and sorted in time before they are loaded to CCIX. So future loads will be sorted, we are concern here with existing data loaded during First Load.

For full case description, please read here.
https://drive.google.com/open?id=0BzGLNskaj70URmZURlVDWVNYd2M

This is our second attempt and I can see the log file is still piling up despite the new script having partition-based index rebuild. We need to keep this in check.

My questions are:

I already executed the 2nd attempt (in progress), is it possible to check which partitions are already processed and sorted? I have OFFLINE = ON in the rebuild script.
How can we manage the size of transaction log while a partion-based index rebuild is in progress? We need to keep this in check and regularly truncated every partition if possible.
The index rebuild failed initially due to insufficient space in log file. We added a new log file and use a separate disk. But how we make make sure this wont happen again in this second run? Will the partition-based rebuild sufficient to guarantee we can rebuild successfully?

Appreciate your help.

Rebuild scripts by partition (2nd attempt) – IN PROGRESS

Elapsed: 17 hrs, 34 mins

--create a new clustered row index (CRI)
CREATE CLUSTERED INDEX [Analog_ColumnStoreIndex] 
ON [dbo].[Analog]
(
    [LogTime] ASC,
    [CTDIID] ASC,
    [WindFarmId] ASC,
    [StationId] ASC
) 
WITH (
    DROP_EXISTING = ON,
    PAD_INDEX = OFF, 
    STATISTICS_NORECOMPUTE = OFF, 
    SORT_IN_TEMPDB = OFF,
    ALLOW_ROW_LOCKS = ON, 
    ALLOW_PAGE_LOCKS = ON,
    ONLINE = OFF
) 
ON [AnalogMonthlyPScheme]([LogTime])
GO

Best Answer

I would process this one partition at a time. At a high level:

Create a new empty target table with a partitioned clustered rowstore index
Create a working table with an unpartitioned clustered columnstore index
SWITCH partition 1 from the source table to the working table
Convert the working table to a rowstore clustered index using DROP_EXISTING
Add a CHECK constraint to the working table that covers the range of values in partition 1
SWITCH the working table into partition 1 of the target table
Drop the empty working table
Repeat from step 2 for the next partition
Once all partitions are processed, drop the original table, and rename the new

This makes maximum use of minimally-logged operations, minimizes workspace, and allows the transaction log to be cleared between partitions if necessary.

Demo

Partitioning

CREATE PARTITION FUNCTION [AnalogMonthlyP] (datetime2(0))
AS RANGE RIGHT FOR VALUES(
N'2014-09-01T00:00:00.000', N'2014-10-01T00:00:00.000', N'2014-11-01T00:00:00.000',
N'2014-12-01T00:00:00.000', N'2015-01-01T00:00:00.000', N'2015-02-01T00:00:00.000',
N'2015-03-01T00:00:00.000', N'2015-04-01T00:00:00.000', N'2015-05-01T00:00:00.000',
N'2015-06-01T00:00:00.000', N'2015-07-01T00:00:00.000', N'2015-08-01T00:00:00.000',
N'2015-09-01T00:00:00.000', N'2015-10-01T00:00:00.000', N'2015-11-01T00:00:00.000',
N'2015-12-01T00:00:00.000', N'2016-01-01T00:00:00.000', N'2016-02-01T00:00:00.000',
N'2016-03-01T00:00:00.000', N'2016-04-01T00:00:00.000', N'2016-05-01T00:00:00.000',
N'2016-06-01T00:00:00.000', N'2016-07-01T00:00:00.000', N'2016-08-01T00:00:00.000',
N'2016-09-01T00:00:00.000', N'2016-10-01T00:00:00.000', N'2016-11-01T00:00:00.000',
N'2016-12-01T00:00:00.000', N'2017-01-01T00:00:00.000', N'2017-02-01T00:00:00.000',
N'2017-03-01T00:00:00.000', N'2017-04-01T00:00:00.000', N'2017-05-01T00:00:00.000',
N'2017-06-01T00:00:00.000', N'2017-07-01T00:00:00.000', N'2017-08-01T00:00:00.000',
N'2017-09-01T00:00:00.000', N'2017-10-01T00:00:00.000', N'2017-11-01T00:00:00.000',
N'2017-12-01T00:00:00.000', N'2018-01-01T00:00:00.000', N'2018-02-01T00:00:00.000',
N'2018-03-01T00:00:00.000', N'2018-04-01T00:00:00.000', N'2018-05-01T00:00:00.000',
N'2018-06-01T00:00:00.000', N'2018-07-01T00:00:00.000', N'2018-08-01T00:00:00.000',
N'2018-09-01T00:00:00.000', N'2018-10-01T00:00:00.000', N'2018-11-01T00:00:00.000',
N'2018-12-01T00:00:00.000');

CREATE PARTITION SCHEME [AnalogMonthlyPScheme]
AS PARTITION [AnalogMonthlyP]
ALL TO ([PRIMARY]);

Numbers table setup (if required)

WITH Ten(N) AS 
(
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
)   
SELECT TOP (10000000) 
    n = IDENTITY(int, 1, 1)
INTO   dbo.Numbers
FROM   Ten T10,
       Ten T100,
       Ten T1000,
       Ten T10000,
       Ten T100000,
       Ten T1000000,
       Ten T10000000;

ALTER TABLE dbo.Numbers
ADD CONSTRAINT PK_dbo_Numbers_n
PRIMARY KEY CLUSTERED (n)
WITH (SORT_IN_TEMPDB = ON, MAXDOP = 1, FILLFACTOR = 100);

Source table and data

-- Source table currently partitioned clustered columnstore
CREATE TABLE [dbo].[Analog]
(
    LogTime datetime2(0) NOT NULL,
    CTDID integer NOT NULL,
    WindFarmId integer NOT NULL,
    StationId integer NOT NULL,

    INDEX [Analog_ColumnStoreIndex]
        CLUSTERED COLUMNSTORE
)
ON [AnalogMonthlyPScheme](LogTime);

-- Some test data
INSERT dbo.Analog WITH (TABLOCKX)
    (LogTime, CTDID, WindFarmId, StationId)
SELECT
    DATEADD(SECOND, N.n, '20140801'),
    CHECKSUM(NEWID()),
    CHECKSUM(NEWID()),
    CHECKSUM(NEWID())
FROM dbo.Numbers AS N -- ten million rows
OPTION (MAXDOP 1);

Target table and work table

-- The final table we want - partitioned rowstore clustered index
CREATE TABLE [dbo].[AnalogNew]
(
    LogTime datetime2(0) NOT NULL,
    CTDID integer NOT NULL,
    WindFarmId integer NOT NULL,
    StationId integer NOT NULL,

    INDEX CCIX_Analog
        CLUSTERED (LogTime, CTDID, WindFarmId, StationId)
)
ON [AnalogMonthlyPScheme](LogTime);

-- Working table
CREATE TABLE dbo.AnalogSwitch
(
    LogTime datetime2(0) NOT NULL,
    CTDID integer NOT NULL,
    WindFarmId integer NOT NULL,
    StationId integer NOT NULL,

    INDEX CCI CLUSTERED COLUMNSTORE
);

Partition 1

-- Process the first partition
ALTER TABLE dbo.Analog 
SWITCH PARTITION 1 
TO dbo.AnalogSwitch;

-- Optional, may help parallel plan
CREATE STATISTICS s ON dbo.AnalogSwitch (LogTime);

-- Convert to rowstore clustered index
CREATE CLUSTERED INDEX CCI
ON dbo.AnalogSwitch (LogTime, CTDID, WindFarmId, StationId)
WITH (DROP_EXISTING = ON);

-- Specify data range
ALTER TABLE dbo.AnalogSwitch
    ADD CONSTRAINT CK 
    CHECK (LogTime >= '20140801' AND LogTime < '20140901');

-- Move to target table
ALTER TABLE dbo.AnalogSwitch 
SWITCH TO dbo.AnalogNew 
PARTITION 1;

-- Recreate switch table
DROP TABLE dbo.AnalogSwitch;

CREATE TABLE dbo.AnalogSwitch
(
    LogTime datetime2(0) NOT NULL,
    CTDID integer NOT NULL,
    WindFarmId integer NOT NULL,
    StationId integer NOT NULL,

    INDEX CCI CLUSTERED COLUMNSTORE
);

Partition 2

-- Process the second partition
ALTER TABLE dbo.Analog 
SWITCH PARTITION 2
TO dbo.AnalogSwitch;

-- Optional, may help parallel plan
CREATE STATISTICS s ON dbo.AnalogSwitch (LogTime);

-- Convert to rowstore clustered index
CREATE CLUSTERED INDEX CCI
ON dbo.AnalogSwitch (LogTime, CTDID, WindFarmId, StationId)
WITH (DROP_EXISTING = ON);

-- Specify data range
ALTER TABLE dbo.AnalogSwitch
    ADD CONSTRAINT CK 
    CHECK (LogTime >= '20140901' AND LogTime < '20141001');

-- Move to target table
ALTER TABLE dbo.AnalogSwitch 
SWITCH TO dbo.AnalogNew 
PARTITION 2;

-- Recreate switch table
DROP TABLE dbo.AnalogSwitch;

CREATE TABLE dbo.AnalogSwitch
(
    LogTime datetime2(0) NOT NULL,
    CTDID integer NOT NULL,
    WindFarmId integer NOT NULL,
    StationId integer NOT NULL,

    INDEX CCI CLUSTERED COLUMNSTORE
);

...and so on. This is fairly straightforward to script.

Best Answer

Demo

Partitioning

Numbers table setup (if required)

Source table and data

Target table and work table

Partition 1

Partition 2

Related Solutions

Sql-server – SQL Server — optimization techniques for rebuilding very large table

Sql-server – Why 2 allocation units during online index rebuild to different file group

Related Question