SQL Server – Updating Rows in Large Tables Efficiently

performancequery-performancesql serverupdate

I want to update one large table that has 83,423,460 rows and growing.

The below query takes 8 minutes to execute successfully:

 UPDATE FPP_Invoice_Revenue
 SET    Till_Prev_Inv_Amt = Till_Prev_Inv_Amt_In_USD / 0.0285714286,
        Cur_Inv_Amt = Cur_Inv_Amt_In_USD / 0.0285714286,
        YTD_Inv_Amt = YTD_Inv_Amt_In_USD / 0.0285714286
 WHERE  SOW_Number = '20014378'

There exists one clustered index. I thought of disabling that index before updating and again will rebuild after update but that also did not work since rebuilding is taking a lot of time.

I have read somewhere this can be achieved by dividing into small parts, but how can I divide the above query?

DDL:

CREATE TABLE [dbo].[FPP_Invoice_Revenue](
[Project_Code] [varchar](10) NOT NULL,
[Project_Desc] [varchar](50) NULL,
[SOW_Number] [varchar](10) NOT NULL,
[SOW_Desc] [varchar](50) NULL,
[Invoice_No] [varchar](50) NOT NULL,
[Inv_Month] [int] NOT NULL,
[Inv_Year] [int] NOT NULL,
[Billing_Date] [smalldatetime] NULL,
[Doc_Currency] [varchar](10) NULL,
[Vertical] [varchar](255) NULL,
[Till_Prev_Inv_Amt] [numeric](24, 10) NULL,
[Cur_Inv_Amt] [numeric](24, 10) NULL,
[YTD_Inv_Amt] [numeric](24, 10) NULL,
[Till_Prev_Inv_Amt_In_USD] [numeric](24, 10) NULL,
[Cur_Inv_Amt_In_USD] [numeric](24, 10) NULL,
[YTD_Inv_Amt_In_USD] [numeric](24, 10) NULL,
CONSTRAINT [PK_FPP_Invoice_Revenue] PRIMARY KEY CLUSTERED 
(
[Project_Code] ASC,
[SOW_Number] ASC,
[Invoice_No] ASC,
[Inv_Month] ASC,
[Inv_Year] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF,       
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]
) ON [PRIMARY]
GO

SET ANSI_PADDING OFF
GO

Execution Plan:

enter image description here

Best Answer

Adding an index on SOW_Number will allow SQL Server to very quickly identify the rows that need to be updated, without requiring a scan of the entire table (assuming a relatively small number of rows match the WHERE clause, making the index highly selective).

CREATE NONCLUSTERED INDEX IX_SOW_Number 
ON dbo.FPP_Temp_Revenue(SOW_Number)
INCLUDE (
        Till_Prev_Inv_Amt_In_USD
        , Cur_Inv_Amt_In_USD
        , YTD_Inv_Amt_In_USD
    );

Even though you have SOW_Number defined as part of the clustered index, it is not the leading column of the index, which means SQL Server must scan the entire index looking for rows that fulfill the WHERE clause. Adding the three columns listed in the INCLUDE clause will allow SQL Server to not perform a lookup into the clustered index for those values, which it would otherwise need to do to as they are used in the calculations inside the SET clause.

I wouldn't normally suggest adding an index like this without evaluating the workload carefully; however I'm assuming you don't have any indexes other than the clustered index you identify in your script. Adding a single index in this case will most likely make a very big difference to the performance of your update, without dramatically affecting other portions of your workload.

I setup a test-bed on my laptop so I could test the performance of adding the suggested index vs having no index on SOW_Number.

First, I created a new database TestDB for this test (I'm using SQL Server on Linux RC3):

USE master;
IF EXISTS (SELECT 1 FROM sys.databases d WHERE d.name = 'TestDB')
DROP DATABASE TestDB;
GO
CREATE DATABASE TestDB
ON (NAME = 'TestDB_Primary', FILENAME = '/var/opt/mssql/data/TestDB_Primary.mdf', SIZE = 5GB, MAXSIZE = 10GB, FILEGROWTH = 1GB)
LOG ON (NAME = 'TestDB_Log', FILENAME = '/var/opt/mssql/log/TestDB_Log.ldf', SIZE = 5GB, MAXSIZE = 10GB, FILEGROWTH = 1GB);
GO

USE TestDB;
CREATE TABLE dbo.FPP_Invoice_Revenue(
         Project_Code varchar(10) NOT NULL
       , Project_Desc varchar(50) NULL
       , SOW_Number varchar(10) NOT NULL
       , SOW_Desc varchar(50) NULL
       , Invoice_No varchar(50) NOT NULL
       , Inv_Month int NOT NULL
       , Inv_Year int NOT NULL
       , Billing_Date smalldatetime NULL
       , Doc_Currency varchar(10) NULL
       , Vertical varchar(255) NULL
       , Till_Prev_Inv_Amt numeric(24, 10) NULL
       , Cur_Inv_Amt numeric(24, 10) NULL
       , YTD_Inv_Amt numeric(24, 10) NULL
       , Till_Prev_Inv_Amt_In_USD numeric(24, 10) NULL
       , Cur_Inv_Amt_In_USD numeric(24, 10) NULL
       , YTD_Inv_Amt_In_USD numeric(24, 10) NULL
CONSTRAINT PK_FPP_Invoice_Revenue PRIMARY KEY CLUSTERED 
(
    Project_Code ASC
    , SOW_Number ASC
    , Invoice_No ASC
    , Inv_Month ASC
    , Inv_Year ASC
) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF,       
    ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]
) ON [PRIMARY];
GO

Next, I inserted 10,000,000 rows to have a sufficiently large table. I know, this is a bit smaller than your table, but it provides enough data to make reasonable extropolations:

TRUNCATE TABLE dbo.FPP_Invoice_Revenue;
;WITH v AS (
       SELECT v.num
       FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9))v(num)
)
INSERT INTO dbo.FPP_Invoice_Revenue
SELECT Project_Code = CONVERT(varchar(10), ROW_NUMBER() OVER (ORDER BY v1.num, v2.num))
       , Project_Desc = CONVERT(varchar(50), CRYPT_GEN_RANDOM(50))
       , SOW_Number = v5.num /* only 10 distinct values */
       , SOW_Desc = CONVERT(varchar(50), CRYPT_GEN_RANDOM(50))
       , Invoice_No = CONVERT(varchar(50), CRYPT_GEN_RANDOM(50))
       , Inv_Month = ROW_NUMBER() OVER (ORDER BY v1.num, v2.num) % 12
       , Inv_Year = 2009 + ROW_NUMBER() OVER (ORDER BY v1.num, v2.num) % 8
       , Billing_Date = DATEFROMPARTS(2009 + ROW_NUMBER() OVER (ORDER BY v1.num, v2.num) % 8, 1, (ROW_NUMBER() OVER (ORDER BY v1.num, v2.num) % 12) + 1)
       , Doc_Currency = CONVERT(varchar(10), CRYPT_GEN_RANDOM(10))
       , Vertical = CONVERT(varchar(255), CRYPT_GEN_RANDOM(255))
       , Till_Prev_Inv_Amt = CONVERT(numeric(24, 10), 1)
       , Cur_Inv_Amt = CONVERT(numeric(24, 10), 2)
       , YTD_Inv_Amt = CONVERT(numeric(24, 10), 3)
       , Till_Prev_Inv_Amt_USD = CONVERT(numeric(24, 10), 4)
       , Cur_Inv_Amt_In_USD = CONVERT(numeric(24, 10), 5)
       , YTD_Inv_Amt_In_USD = CONVERT(numeric(24, 10), 6)
FROM v v1 --10 rows
       CROSS JOIN v v2 --100 rows
       CROSS JOIN v v3 --1000 rows
       CROSS JOIN v v4 --10000 rows
       CROSS JOIN v v5 --100000 rows
       CROSS JOIN v v6 --1000000 rows
       CROSS JOIN v v7;--10000000 rows

Next, I create the non-clustered index I've suggested above:

CREATE NONCLUSTERED INDEX IX_SOW_Number
ON dbo.FPP_Invoice_Revenue(SOW_Number)
INCLUDE (Till_Prev_Inv_Amt, Cur_Inv_Amt, YTD_Inv_Amt);

This took about 1:20 to create on my laptop.

Now, the query:

UPDATE dbo.FPP_Invoice_Revenue
SET Till_Prev_Inv_Amt_In_USD = Till_Prev_Inv_Amt / 0.0285
    , Cur_Inv_Amt_In_USD = Cur_Inv_Amt / 0.0285
    , YTD_Inv_Amt_In_USD = YTD_Inv_Amt / 0.0285
WHERE SOW_Number = '6'; 

The test data is evenly distributed on 10 SOW_Number values; WHERE SOW_Number = '6' indicates we'll update 1,000,000 rows:

(1000000 row(s) affected)

The actual execution plan for the update query:

enter image description here

The execution statistics for this:

SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
Table 'FPP_Invoice_Revenue'. Scan count 1, logical reads 6567261, physical reads 8920, read-ahead reads 209603, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1000000 row(s) affected)

SQL Server Execution Times:
CPU time = 9376 ms, elapsed time = 13689 ms.

As you can see from the above, the query only took around 14 seconds to update 1,000,000 rows. If SOW_Number is more selective, the time required will be proportionally less.

As you've identified in your question, dropping the clustered index does not help increase performance of the update query. Understand that all data in a table that has a clustered index is in fact stored in the clustered index. Dropping and re-creating the clustered index will result in all 83 million rows being written to the log while data is moved in and out of the clustered index. This is a lot of extra I/O that is not necessary for the update to succeed.

You have SOW_Number defined as a varchar(10); if the content of that column is always actually 10 characters long you might consider modifying it to be a char(10) instead of varchar(10), since that is actually more efficient. Consider this for all the varchar columns you have. Also, does Vertical actually need to be 255 characters long? Presumably, Inv_Month and Inv_Year are actually not required to store up to 2,147,483,647? You could likely convert those columns to smallint and save 4 bytes per row. Keeping in mind the columns in the clustered index key are used (duplicated) in every non-clustered index, the space savings from right-sizing the data might be substantial.