SQL Server Performance – Slow SQL Script for String Manipulation

performancequery-performancesql server

I have table containing 700,000 rows containing a string id field with values such as rec-232276-dup-0 and rec-354240-org. The rec- is constant but the other parts of the id can change.

I want to split this string up so that I have just the integer part and a boolean value depending if the next part is equal to dup or org (1 for dup, 0 for org) which I will insert back into the table.

I wrote the following loop to do this and functionally it works fine but when I run it over the full 700,000 rows it takes an inordinately long time (+12 hours and counting).

What have I done wrong to cause it to take this amount of time? Is it the string manipulation that causes this? What can I do to improve this?

Thanks for your help.

My script is as follows:

select id
into #ControlTable 
from [dbo].[original_test_dataset]

declare @TableID varchar(20)

while exists (select * from #ControlTable)
begin

   select @TableID = (select top 1 id
                    from #ControlTable
                    order by id asc)

   declare @duplicate bit
   declare @id_only varchar(10)

   --1. Find id only
   -- Trim off rec-
   set @id_only = REPLACE(@TableID,'rec-','')
   -- Find position of first - and then take everything before it
   set @id_only = LEFT(@id_only,CHARINDEX('-',@id_only,0))
   set @id_only = REPLACE(@id_only,'-','')
   UPDATE original_test_dataset set id_only = @id_only WHERE id = @TableID;

   --2. Find if duplicate
   IF(PATINDEX('%dup%',@TableID) = 0)   
        BEGIN
           -- No duplicate so original file
           UPDATE original_test_dataset set duplicate = 0 WHERE id = @TableID;
        END
   ELSE
       BEGIN
           -- Duplicate
           UPDATE original_test_dataset set duplicate = 1 WHERE id = @TableID;
       END

   delete #ControlTable
   where id = @TableID
END

drop table #ControlTable

Added table definition as requested:

CREATE TABLE [dbo].[original_test_dataset](
    [id] [varchar](50) NULL,
    [ FirstName] [varchar](50) NULL,
    [ LastName] [varchar](50) NULL,
    [ Phone1] [varchar](50) NULL,
    [ Phone2] [varchar](50) NULL,
    [ Phone3] [varchar](50) NULL,
    [ No] [varchar](50) NULL,
    [ Road] [varchar](50) NULL,
    [ Village] [varchar](50) NULL,
    [ Town] [varchar](50) NULL,
    [ PC] [varchar](50) NULL,
    [ County] [varchar](50) NULL,
    [ DOB] [varchar](50) NULL,
    [id_only] [varchar](10) NULL,
    [duplicate] [bit] NULL
) ON [PRIMARY]

GO

/****** Object:  Index [PK_ORIGINAL_TEST_DATASET_ID]    Script Date: 03/09/2014  07:47:19 ******/
CREATE CLUSTERED INDEX [PK_ORIGINAL_TEST_DATASET_ID] ON [dbo].[original_test_dataset]
(
    [id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON   [PRIMARY]
GO

Best Answer

You have a performance question so I recommend you read How to analyse SQL Server performance to begin with.

As about your script: you're doing it as slowly as possible. Row by painfully slow row. Not even a cursor slow, is worse than a cursor slow. What you have here is a single update:

update original_test_dataset
set duplicate= case when PATINDEX('%dup%', id)=0 then 0 else 1 end,
 id_only = REPLACE((LEFT(REPLACE(id,'rec-',''), CHARINDEX('-',REPLACE(id,'rec-',''),0)),'-','');

That's it, nothing else.

What you have in the original script is slow because:

  • it operates on row by row as opposed to sets. SQL is designed to manipulate sets, not rows.
  • it perform a lot of writes in standalone commits resulting in huge flush commit waits. Batch commit should be used.
  • it performs multiple unindexed end-to-end scans on #ControlTable (every operation on #ControlTable is an unindexed end-to-end scan, it should had an index on id, yes #temp tables do need indexes)
  • it does two updates per row instead of one update of two fields
  • I can't know whether the access to original_test_dataset is in any way optimized, in lack of an index on id is likely is not.

declare @TableID varchar(20) vs. [id] [varchar](50) NULL: you're silently truncating data. Basically, your script corrupts the table.