SQL Server Performance – Slow SQL Script for String Manipulation

performancequery-performancesql server

I have table containing 700,000 rows containing a string id field with values such as rec-232276-dup-0 and rec-354240-org. The rec- is constant but the other parts of the id can change.

I want to split this string up so that I have just the integer part and a boolean value depending if the next part is equal to dup or org (1 for dup, 0 for org) which I will insert back into the table.

I wrote the following loop to do this and functionally it works fine but when I run it over the full 700,000 rows it takes an inordinately long time (+12 hours and counting).

What have I done wrong to cause it to take this amount of time? Is it the string manipulation that causes this? What can I do to improve this?

Thanks for your help.

My script is as follows:

select id
into #ControlTable 
from [dbo].[original_test_dataset]

declare @TableID varchar(20)

while exists (select * from #ControlTable)
begin

   select @TableID = (select top 1 id
                    from #ControlTable
                    order by id asc)

   declare @duplicate bit
   declare @id_only varchar(10)

   --1. Find id only
   -- Trim off rec-
   set @id_only = REPLACE(@TableID,'rec-','')
   -- Find position of first - and then take everything before it
   set @id_only = LEFT(@id_only,CHARINDEX('-',@id_only,0))
   set @id_only = REPLACE(@id_only,'-','')
   UPDATE original_test_dataset set id_only = @id_only WHERE id = @TableID;

   --2. Find if duplicate
   IF(PATINDEX('%dup%',@TableID) = 0)   
        BEGIN
           -- No duplicate so original file
           UPDATE original_test_dataset set duplicate = 0 WHERE id = @TableID;
        END
   ELSE
       BEGIN
           -- Duplicate
           UPDATE original_test_dataset set duplicate = 1 WHERE id = @TableID;
       END

   delete #ControlTable
   where id = @TableID
END

drop table #ControlTable

Added table definition as requested:

CREATE TABLE [dbo].[original_test_dataset](
    [id] [varchar](50) NULL,
    [ FirstName] [varchar](50) NULL,
    [ LastName] [varchar](50) NULL,
    [ Phone1] [varchar](50) NULL,
    [ Phone2] [varchar](50) NULL,
    [ Phone3] [varchar](50) NULL,
    [ No] [varchar](50) NULL,
    [ Road] [varchar](50) NULL,
    [ Village] [varchar](50) NULL,
    [ Town] [varchar](50) NULL,
    [ PC] [varchar](50) NULL,
    [ County] [varchar](50) NULL,
    [ DOB] [varchar](50) NULL,
    [id_only] [varchar](10) NULL,
    [duplicate] [bit] NULL
) ON [PRIMARY]

GO

/****** Object:  Index [PK_ORIGINAL_TEST_DATASET_ID]    Script Date: 03/09/2014  07:47:19 ******/
CREATE CLUSTERED INDEX [PK_ORIGINAL_TEST_DATASET_ID] ON [dbo].[original_test_dataset]
(
    [id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON   [PRIMARY]
GO

Best Answer

You have a performance question so I recommend you read How to analyse SQL Server performance to begin with.

As about your script: you're doing it as slowly as possible. Row by painfully slow row. Not even a cursor slow, is worse than a cursor slow. What you have here is a single update:

update original_test_dataset
set duplicate= case when PATINDEX('%dup%', id)=0 then 0 else 1 end,
 id_only = REPLACE((LEFT(REPLACE(id,'rec-',''), CHARINDEX('-',REPLACE(id,'rec-',''),0)),'-','');

That's it, nothing else.

What you have in the original script is slow because:

it operates on row by row as opposed to sets. SQL is designed to manipulate sets, not rows.
it perform a lot of writes in standalone commits resulting in huge flush commit waits. Batch commit should be used.
it performs multiple unindexed end-to-end scans on #ControlTable (every operation on #ControlTable is an unindexed end-to-end scan, it should had an index on id, yes #temp tables do need indexes)
it does two updates per row instead of one update of two fields
I can't know whether the access to original_test_dataset is in any way optimized, in lack of an index on id is likely is not.

declare @TableID varchar(20) vs. [id] [varchar](50) NULL: you're silently truncating data. Basically, your script corrupts the table.

Related Solutions

SQL Server Hierarchical Order – Parent-Child Tree Hierarchical ORDER BY

OK, enough brain cells are dead.

SQL Fiddle

WITH cte AS
(
  SELECT 
    [ICFilterID], 
    [ParentID],
    [FilterDesc],
    [Active],
    CAST(0 AS varbinary(max)) AS Level
  FROM [dbo].[ICFilters]
  WHERE [ParentID] = 0
  UNION ALL
  SELECT 
    i.[ICFilterID], 
    i.[ParentID],
    i.[FilterDesc],
    i.[Active],  
    Level + CAST(i.[ICFilterID] AS varbinary(max)) AS Level
  FROM [dbo].[ICFilters] i
  INNER JOIN cte c
    ON c.[ICFilterID] = i.[ParentID]
)

SELECT 
  [ICFilterID], 
  [ParentID],
  [FilterDesc],
  [Active]
FROM cte
ORDER BY [Level];

Sql-server – Query to return fields of distinct values per key

Without knowing anything about the source data, perhaps this would do what you want?

USE Test;
GO
CREATE TABLE GENDER
(
    ORG INT NOT NULL
    , GENDER VARCHAR(1) NOT NULL
);

CREATE TABLE AGE
(
    ORG INT NOT NULL
    , AGE TINYINT
);

CREATE TABLE STATES
(
    ORG INT NOT NULL
    , STATENAME VARCHAR(255)
);

INSERT INTO Gender (ORG, GENDER) VALUES (1, 'M');
INSERT INTO Gender (ORG, GENDER) VALUES (1, 'F');
INSERT INTO Gender (ORG, GENDER) VALUES (2, 'M');
INSERT INTO Gender (ORG, GENDER) VALUES (2, 'F');
INSERT INTO Gender (ORG, GENDER) VALUES (3, 'M');
INSERT INTO Gender (ORG, GENDER) VALUES (3, 'F');

INSERT INTO AGE (ORG, AGE) VALUES (1,27);
INSERT INTO AGE (ORG, AGE) VALUES (1,28);
INSERT INTO AGE (ORG, AGE) VALUES (1,29);
INSERT INTO AGE (ORG, AGE) VALUES (1,30);
INSERT INTO AGE (ORG, AGE) VALUES (2,37);
INSERT INTO AGE (ORG, AGE) VALUES (2,38);
INSERT INTO AGE (ORG, AGE) VALUES (2,39);
INSERT INTO AGE (ORG, AGE) VALUES (2,40);
INSERT INTO AGE (ORG, AGE) VALUES (3, 2);

INSERT INTO STATES (ORG, STATENAME) VALUES (1,'FL');
INSERT INTO STATES (ORG, STATENAME) VALUES (1,'GA');
INSERT INTO STATES (ORG, STATENAME) VALUES (1,'MN');
INSERT INTO STATES (ORG, STATENAME) VALUES (1,'NM');
INSERT INTO STATES (ORG, STATENAME) VALUES (2,'FL');
INSERT INTO STATES (ORG, STATENAME) VALUES (2,'MN');
INSERT INTO STATES (ORG, STATENAME) VALUES (2,'NM');
INSERT INTO STATES (ORG, STATENAME) VALUES (3,'FL');
INSERT INTO STATES (ORG, STATENAME) VALUES (3,'GA');
INSERT INTO STATES (ORG, STATENAME) VALUES (3,'NM');

CREATE TABLE FACTS
(
    ORG INT NOT NULL
    , GENDER VARCHAR(1) NULL
    , AGE INT NULL
    , STATENAME VARCHAR(255) NULL
);

INSERT INTO FACTS (ORG, GENDER, AGE, STATENAME)
SELECT ORG, GENDER, NULL, NULL
FROM GENDER
GROUP BY ORG, GENDER
UNION ALL
SELECT ORG, NULL, AGE, NULL
FROM AGE
GROUP BY ORG, AGE
UNION ALL
SELECT ORG, NULL, NULL, STATENAME
FROM STATES;

SELECT *
FROM FACTS
ORDER BY ORG;

The results:

enter image description here

This will create a FACTS table that has all the data from several source tables. As @ypercube and @jon-seigel said, this really doesn't make much sense; perhaps we are missing something compelling about your setup.

If this is not what you were expecting, please provide the source tables, and any other pertinent details.

Best Answer

Related Solutions

SQL Server Hierarchical Order – Parent-Child Tree Hierarchical ORDER BY

Sql-server – Query to return fields of distinct values per key

Related Question