Sql-server – Patterns for Deleting large amounts of data

sql-server-2008-r2

So I'm looking at how to delete a large amount of data from a handful of tables that do not have the key I need to easily isolate the rows that need to be deleted. The situation is as so:

I have a ID let's say a RequestID. I've decided that all of an arbitrary RequestID's data entries are invalid and so I want to purge them out of my tables to clean up.

Table A is a dimensional table that has my RequestID and it also has all the associated URL IDs for any particular RequestID.

Table B also contains dimensional data but does not have RequestID, so I have to use Table A to look up which records in Table B are valid delete candidates.

These tables are anywhere from 1 million to a billion rows, so the deletes have to be batched to work properly.

My thought was to do something like this but it doesn't seem very performant:

WHILE EXISTS (SELECT TOP 1 1 FROM TableB JOIN TableA ON TableA.URLID = TableB.URLID)
BEGIN
DELETE TOP 50000 a
FROM TableB a
JOIN TableA ON TableB.URLID = TableA.URLID
WHERE TableA.RequestID = <some_value>
END

I'm not sure how better I could delete the data

Edit: sorry I forgot to include requestID in the delete code example

Best Answer

but you are missing RequestID hopefully you have index on URLID and RequestID

select 1 
while (@@rowcount > 0)
BEGIN
  DELETE TOP (50000) b 
  FROM TableB b 
  JOIN TableA a
    ON b.URLID = a.URLID 
   AND a.RequestID = @RequestID
END

Disable FK can help but be sure you are not breaking any FK

Related Solutions

Sql-server – SQL Server 2008 R2 : multiple files and filegroup

1: yes.

2: depends;) The query execution is still driven by the optimizer. It wont do paralellization for small result sets.

3: Ah - what would that be good given that you are still reading one backup file ;)?

4: Not running the files on one underlying SAN and thus improoving your IO budget?

The most brutal thing I have ever seen like that had nearly 30 files all on separate SAN volumes (count of hard discs going close to 200) and it was done as every LUN had a queue limit in the driver of 255 outstanding requests, which the SAN (with a 32 gigabyte cache) was just not caring about ;) THat thing was pulling in nearly 1.5 gigabyte per second over multiple fiber connections.

Sql-server – Is this a good strategy for importing a large amount of data and decomposing as an ETL

If you are confident in the integrity of the data being imported, it may be a good idea to disable all the constraints to your database before beginning your inserts and then re-enabling them after the fact.

See this helpful stack overflow answer from awhile back: Can foreign key constraints be temporarily disabled using T-SQL?

This will save you the head ache of having to worry about layering the inserts in order to respect the existing constraints of the database you are loading into.

In terms of the actual inserts themselves, I'd be on the side of not using cursors. Not only is the process slow but they take up a large amount off memory and create db locks. If you are cursor-ing through a very large amount of rows you also run the risk of very quickly escalating the size of the database logs. If the server is only an average one then, space may eventually be a concern. Try to consider a more set based approach when doing the additional inserts needed for your process.

example, if you can do this:

insert into t1 (col1)
SELECT col1 FROM t2

instead of this:

...
insert into t1 (col1) values ('foo');
insert into t1 (col1) values ('bar');
insert into t1 (col1 values 
...

Related Question