Sql-server – Updating a table with more than 850 million rows of data

lockingsql serversql-server-2008-r2

I have been tasked with writing an update query to update a table with more than 850 million rows of data. Here are the table structures:

Source Tables :

    CREATE TABLE [dbo].[SourceTable1](
    [ProdClassID] [varchar](10) NOT NULL,
    [PriceListDate] [varchar](8) NOT NULL,
    [PriceListVersion] [smallint] NOT NULL,
    [MarketID] [varchar](10) NOT NULL,
    [ModelID] [varchar](20) NOT NULL,
    [VariantId] [varchar](20) NOT NULL,
    [VariantType] [tinyint] NULL,
    [Visibility] [tinyint] NULL,
 CONSTRAINT [PK_SourceTable1] PRIMARY KEY CLUSTERED 
(
    [VariantId] ASC,
    [ModelID] ASC,
    [MarketID] ASC,
    [ProdClassID] ASC,
    [PriceListDate] ASC,
    [PriceListVersion] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, 
IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, 
ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 90)
    )

CREATE TABLE [dbo].[SourceTable2](
    [Id] [uniqueidentifier] NOT NULL,
    [ProdClassID] [varchar](10) NULL,
    [PriceListDate] [varchar](8) NULL,
    [PriceListVersion] [smallint] NULL,
    [MarketID] [varchar](10) NULL,
    [ModelID] [varchar](20) NULL,
 CONSTRAINT [PK_SourceTable2] PRIMARY KEY CLUSTERED 
(
    [Id] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, 
IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, 
ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 91) ON [PRIMARY]
    ) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

SourceTable1 contains 52 million rows of data and SourceTable2 contains 400,000 rows of data.

Here is the TargetTable structure

CREATE TABLE [dbo].[TargetTable](
    [ChassisSpecificationId] [uniqueidentifier] NOT NULL,
    [VariantId] [varchar](20) NOT NULL,
    [VariantType] [tinyint] NULL,
    [Visibility] [tinyint] NULL,
 CONSTRAINT [PK_TargetTable] PRIMARY KEY CLUSTERED 
(
    [ChassisSpecificationId] ASC,
    [VariantId] ASC
    )WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, 
ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 71) ON [PRIMARY]
    ) ON [PRIMARY]

The relationship between these tables are as follows:

  • SourceTable1.VariantID is related to TargetTable.VariantID
  • SourceTable2.ID is related to TargetTable.ChassisSpecificationId

The update requirement is as follows:

  1. Get the values for VariantType and Visibility from SourceTable1 for each VariantID, having the maximum value in the PriceListVersion column.
  2. Get the value of the ID column from SourceTable2 where the values of ModelID, ProdClassID, PriceListDate and MarketID match with that of SourceTable1.
  3. Now update the TargetTable with the values for VariantType and Visibility where the
    ChassisspecificationID matches SourceTable2.ID and VariantID matches SourceTable1.VariantID

The challenge is to do this update on live production, with minimum locking. Here is the query I have put together.

-- Check if Temp table already exists and drop if it does
IF EXISTS(
        SELECT NULL 
        FROM tempdb.sys.tables
        WHERE name LIKE '#CSpec%'
      )
BEGIN
    DROP TABLE #CSpec;
END;

-- Create Temp table to assign sequence numbers
CREATE Table #CSpec
(
    RowID int,
    ID uniqueidentifier,
    PriceListDate VarChar(8),
    ProdClassID VarChar(10),
    ModelID VarChar(20),
    MarketID Varchar(10)
 );

-- Populate temp table 
INSERT INTO #CSpec
SELECT ROW_NUMBER() OVER (ORDER BY MarketID) RowID,
       CS.id, 
       CS.pricelistdate, 
       CS.prodclassid, 
       CS.modelid, 
       CS.marketid 
FROM   dbo.SourceTable2 CS 
WHERE CS.MarketID IS NOT NULL;

-- Declare variables to hold values used for updates
DECLARE @min            int, 
        @max            int,
        @ID             uniqueidentifier,
        @PriceListDate  varchar(8),
        @ProdClassID    varchar(10),
        @ModelID        varchar(20),
        @MarketID       varchar(10);
-- Set minimum and maximum values for looping
SET @min = 1;
SET @max = (SELECT MAX(RowID) From #CSpec);

-- Populate other variables in a loop
WHILE @min <= @max
BEGIN
    SELECT 
        @ID = ID,
        @PriceListDate = PriceListDate,
        @ProdClassID = ProdClassID,
        @ModelID = ModelID,
        @MarketID = MarketID
    FROM #CSpec
    WHERE RowID = @min;  

-- Use CTE to get relevant values from SourceTable1 
    ;WITH Variant_CTE AS
    (
    SELECT  V.variantid, 
            V.varianttype, 
            V.visibility,
            MAX(V.PriceListVersion) LatestPriceVersion
    FROM    SourceTable1 V 
    WHERE       V.ModelID = @ModelID
            AND V.ProdClassID = @ProdClassID
            AND V.PriceListDate = @PriceListDate
            AND V.MarketID = @MarketID
    GROUP BY
            V.variantid, 
            V.varianttype, 
            V.visibility
    )

-- Update the TargetTable with the values obtained in the CTE
    UPDATE      SV 
        SET     SV.VariantType = VC.VariantType, 
                SV.Visibility = VC.Visibility
    FROM        spec_variant SV 
    INNER JOIN  TargetTable VC
    ON          SV.VariantId = VC.VariantId
    WHERE       SV.ChassisSpecificationId = @ID
                AND SV.VariantType IS NULL
                AND SV.Visibility IS NULL;

    -- Increment the value of loop variable
    SET @min = @min+1;
END
-- Clean up
DROP TABLE #CSpec

It takes about 30 seconds when I set the limit of iterations to 10, by hardcoding the value of @max variable. However, when I increase the limit to 50 iterations, then it takes almost 4 minutes to complete. I am concerned that the execution time taken for 400,000 iterations will run into multiple days on production. However, that might still be acceptable, if the TargetTable does not get locked down, preventing users from accessing it.

All inputs are welcome.

Thanks,
Raj

Best Answer

To speed things up, you could try

  • Adding a primary key to #CSpec.RowID so you don't scan it every iteration
  • Change the CTE to a temp table with suitable PK. See next point too
  • Add an index on SourceTable1 to match the CTE WHERE clause: currently the PK will be scanned, meaning all SourceTable1 rows will be scanned every iteration. All 52 million rows
  • SourceTable2.MarketID also does not have an index, but I wouldn't worry about this because it is scanned once only (as I understand it)

The query plans here should show a lot of scans because you have poor indexes for the operations you are doing.

Target table indexing appears OK

Another observation: uniqueidentifier and varchar are bad choices for clustered indexes (your PKs here): too wide, not increasing, overhead of collection comparisons at least

Edit, another observation (thanks to @Marian)

Your clustered index is wide generally. Every non-clustered index points to the clustered index, which means a huge NC index too

You could probably achieve the same result by reordering the clustered PK.