Sql-server – Index fragmentation while continuously processing

fragmentationsql serversql-server-2005

SQL Server 2005

I need to be able to continuously process about 350M records in a 900M record table. The query I'm using to select the records to process becomes badly fragmented as I process and I have a need to stop the processing to rebuild the index. Pseudo data model & query …

/**************************************/
CREATE TABLE [Table] 
(
    [PrimaryKeyId] [INT] IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,
    [ForeignKeyId] [INT] NOT NULL,
    /* more columns ... */
    [DataType] [CHAR](1) NOT NULL,
    [DataStatus] [DATETIME] NULL,
    [ProcessDate] [DATETIME] NOT NULL,
    [ProcessThreadId] VARCHAR (100) NULL
);

CREATE NONCLUSTERED INDEX [Idx] ON [Table] 
(
    [DataType],
    [DataStatus],
    [ProcessDate],
    [ProcessThreadId]
);
/**************************************/

/**************************************/
WITH cte AS (
    SELECT TOP (@BatchSize) [PrimaryKeyId], [ProcessThreadId]
    FROM [Table] WITH ( ROWLOCK, UPDLOCK, READPAST )
    WHERE [DataType] = 'X'
    AND [DataStatus] IS NULL
    AND [ProcessDate] < DATEADD(m, -2, GETDATE()) -- older than 2 months
    AND [ProcessThreadId] IS NULL
)
UPDATE cte
SET [ProcessThreadId] = @ProcessThreadId;

SELECT * FROM [Table] WITH ( NOLOCK )
WHERE [ProcessThreadId] = @ProcessThreadId;
/**************************************/

Data content…
While the column [DataType] is typed as a CHAR(1), about 35% of all the records equal 'X' with the remainder equaling 'A'.
Of only the records where [DataType] equals 'X', about 10% will have a NOT NULL [DataStatus] value.

The [ProcessDate] and [ProcessThreadId] columns will be updated for every record processed.
The [DataType] column is updated ('X' is changed to 'A') about 10% of the time.
The [DataStatus] column is updated less than 1% of the time.

For now my solution is to select the primary key of all the records to process into a separate processing table. I delete the keys as I process them so that as the index fragments I'm dealing with less records.

However, this doesn't fit the workflow I want to have so that these data are processed continuously, without manual intervention and significant downtime. I do anticipate downtime on a quarterly basis for housekeeping chores. But now, without the separate processing table, I can not get through processing even half of the data set without the fragmentation becoming so bad as to necessitate stopping and rebuilding the index.

Any recommendations for indexing or a different data model? Is there a pattern I need to research?
I have full control of the data model and the process software so nothing is off the table.

Best Answer

What you're doing is you're using a table as a queue. Your update is the dequeue method. But the clustered index on the table is a poor choice for a queue. Using tables as Queues actually impose quite stringent requirements on the table design. Your clustered index must be the dequeue order, in this case likely ([DataType], [DataStatus], [ProcessDate]). You can implement the primary key as a nonclustered constraint. Drop the non-clustered index Idx, as the clustered key takes its role.

Another important piece of the puzzle is to keep the row size constant during processing. You have declared the ProcessThreadId as a VARCHAR(100) which implies the row grows and shrinks as is being 'processed' because the field value changes from NULL to non-null. This grow-and-shrink pattern on the row causes page splits and fragmentation. I can't possibly imagine a thread ID that is 'VARCHAR(100)'. Use a fixed length type, perhaps an INT.

As a side note, you do not need to dequeue in two steps (UPDATE followed by SELECT). You can use OUTPUT clause, as explained in the article linked above:

/**************************************/
CREATE TABLE [Table] 
(
    [PrimaryKeyId] [INT] IDENTITY(1,1) NOT NULL PRIMARY KEY NONCLUSTERED,
    [ForeignKeyId] [INT] NOT NULL,
    /* more columns ... */
    [DataType] [CHAR](1) NOT NULL,
    [DataStatus] [DATETIME] NULL,
    [ProcessDate] [DATETIME] NOT NULL,
    [ProcessThreadId] INT NULL
);

CREATE CLUSTERED INDEX [Cdx] ON [Table] 
(
    [DataType],
    [DataStatus],
    [ProcessDate]
);
/**************************************/

declare @BatchSize int, @ProcessThreadId int;

/**************************************/
WITH cte AS (
    SELECT TOP (@BatchSize) [PrimaryKeyId], [ProcessThreadId] , ... more columns 
    FROM [Table] WITH ( ROWLOCK, UPDLOCK, READPAST )
    WHERE [DataType] = 'X'
    AND [DataStatus] IS NULL
    AND [ProcessDate] < DATEADD(m, -2, GETDATE()) -- older than 2 months
    AND [ProcessThreadId] IS NULL
)
UPDATE cte
SET [ProcessThreadId] = @ProcessThreadId
OUTPUT DELETED.[PrimaryKeyId] , ... more columns ;
/**************************************/

In addition I would consider moving successfully processed items into a different, archive, table. You want your queue tables to hover near zero size, you do no want them to grow as they retain 'history' from unneeded old entries. You may also consider partitioning by [ProcessDate] as an alternative (ie. one current active partition that acts as the queue and stores entries with NULL ProcessDate, and another partition for everything non-null. Or multiple partitions for non-null if you want to implement efficient deletes (switch out) for data that has passed the mandated retention period. If things get hot you can partition in addition by [DataType] if it has enough selectivity, but that design would be really complicated as it requires partitioning by persisted computed column (a composite column that glues together [DataType] and [ProcessingDate]).