Sql-server – transaction log growing out of control during maintenance

maintenancesql serversql server 2014transaction-log

I have been asked to clean up preceding and appending spaces in several columns of a table. the table is an import from flat file and has many mistyped rows.

The table has over 150 million rows.

So far I have attempted performing an update statement that updates the columns using ltrim(rtrim(columnname)) and I have also tried to create a temp table (heap) and insert into ... select from using the same ltrim(rtrim(columnname)) syntax.

In every case the transaction log grows out of control to the point of running out of disk space.

I understand how transaction logs work fairly well. However in the context of performing some kind of bulk maintenance like this, I am stumped.

Currently the database is in bulk-logged recovery model, I am running transaction log backups every 30 seconds and the transaction log continues to grow faster than I can back it up. I have researched how I might break this job up into batches, but cannot seem to formulate a query that could do this effectively. I have an identity field I could key off of, all the other columns are varchar columns. Right now I'm thinking the answer is to simply get more disk space for the server, however I was hoping there was a better solution.

I am using Microsoft SQL Server 2014 Enterprise Edition.

The flat file is long gone; this is a table that has grown in size over the course of 10 years and many versions of SQL Server. only smaller incremental batch updates are performed and I do believe the new inserts are already cleansed before inserting. So I am fixing old stuff 🙁

Best Answer

You should be able to do batch updates to this table without causing the log to grow uncontrollably.

Here is a simple testbed to show how I'd approach this:

USE tempdb;

SET NOCOUNT ON;

IF OBJECT_ID(N'dbo.Table1', N'U') IS NOT NULL DROP TABLE dbo.Table1;
CREATE TABLE dbo.Table1
(
    id int NOT NULL
        PRIMARY KEY CLUSTERED
        IDENTITY(1,1)
    , COL1 nvarchar(150) NOT NULL
    , COL2 nvarchar(150) NOT NULL
) ON [PRIMARY];

INSERT INTO dbo.Table1 (COL1, COl2)
SELECT N'     ' + sc1.name + N'     '
    , N'     ' + sc2.name + N'     '
FROM sys.syscolumns sc1
    CROSS JOIN sys.syscolumns sc2;

GO

This actually removes the leading and trailing spaces, 4,000 rows per batch:

DECLARE @LogFreePercent DECIMAL(5,2);

/* this temp table will hold all the ID values for
   rows we need to update
*/
IF OBJECT_ID(N'tempdb..#ids', N'U') IS NOT NULL 
DROP TABLE #ids;
CREATE TABLE #ids
(
    id int NOT NULL 
) ON [PRIMARY];
CREATE CLUSTERED INDEX #ids_cx ON #ids(id);

/* this temp table will hold all the ID values for
   rows we've already updated
*/
IF OBJECT_ID(N'tempdb..#fixed_ids', N'U') IS NOT NULL 
DROP TABLE #fixed_ids;
CREATE TABLE #fixed_ids
(
    id int NOT NULL 
) ON [PRIMARY];
CREATE CLUSTERED INDEX #fixed_ids_cx ON #fixed_ids(id);

/* Insert the ID values for rows with leading or trailing
   spaces.  This is not efficient, but should do the trick,
   and ensures we only update rows that actually have leading
   or trailing spaces.
*/
INSERT INTO #ids (id)
SELECT id
FROM dbo.Table1 WITH (NOLOCK)
WHERE COL1 LIKE ' %'
    OR COl1 LIKE '% '
    OR COL2 LIKE ' %'
    OR COl2 LIKE '% ';

/* Run a loop while we still have IDs that need processing
*/
WHILE EXISTS (SELECT TOP(1) 1 FROM #ids)
BEGIN
    /* Use a CTE to select 4000 rows to update from the 
       temp table, outputing the IDs we've fixed into 
       the #fixed_ids table.
    */
    ;WITH src AS (
        SELECT TOP(4000) Table1.*
        FROM dbo.Table1
            INNER JOIN #ids i ON i.id = Table1.id
    )
    UPDATE src
    SET COL1 = LTRIM(RTRIM(COL1))
        , COL2 = LTRIM(RTRIM(COL2))
    OUTPUT inserted.id
    INTO #fixed_ids;

    /* Show some status details
    */
    PRINT (CONVERT(nvarchar(30), GETDATE(), 120) + N' - updated ' 
        + CONVERT(nvarchar(20), @@ROWCOUNT) + ' rows.')
    SELECT @LogFreePercent = 100 - su.used_log_space_in_percent
    FROM sys.dm_db_log_space_usage su;
    PRINT (N'Log free percent: ' + CONVERT(nvarchar(10), @LogFreePercent));

    /* remove the processed rows from the temp tables
    */
    DELETE FROM #ids 
    WHERE EXISTS (SELECT 1 FROM #fixed_ids fi WHERE fi.id = #ids.id);
    DELETE FROM #fixed_ids;

    /* checkpoint the database to allow log records to be truncated
    */
    CHECKPOINT 1;
END

Run this afterwards to see if any rows have leading or trailing spaces:

SELECT [1] = '|' + t1.COL1 + '|'
    , [2] = '|' + t1.COL2 + '|'
FROM dbo.Table1 t1;

You may notice I've got the dreaded WITH (NOLOCK) in the query that reads from the source table, dbo.Table. This is to help prevent our process from blocking other ongoing transactions. Since we're only using this to create a list of IDs we want to operate on, this is an acceptable use of READ UNCOMMITTED.

In my slow dev database, this processed 564,000 rows in just under three minutes, and the log never dipped below 97% free.