Sql-server – Efficiently remove 90% of a table’s data

deleteperformancesql serversql-server-2008-r2

I am working on a process to remove 90% of a table's data because only 10% is needed for testing.

The best way I have found includes storing 10% of the table's rows into a temporary table.

Current method

SELECT TOP 10 PERCENT *
INTO #temp_some_table
FROM some_table (nolock)
ORDER BY some_column DESC

TRUNCATE TABLE some_table 

INSERT INTO some_table 
SELECT *
FROM #temp_some_table 

DROP TABLE #temp_some_table 

This method is filling up tempdb and causing the disk to fill up as well.

Questions

Is there a more efficient way to delete 90% of a table's data
ex (DELETE TOP 90 PERCENT FROM sometable)

or

Is there a way to insert 10% of some_table's data into a temporary table using batches? Something like this:

DECLARE @r INT;

WHILE @r > 0
BEGIN

BEGIN TRANSACTION;

INSERT INTO [dbo].[##temp_cds_Basket]
SELECT TOP 10 PERCENT *
FROM [dbo].[cds_basket] s

SET @r = @@ROWCOUNT;
print @r 

COMMIT TRANSACTION

END

Possible solution

How about this?

SET NOCOUNT ON;

DECLARE @r INT;
DECLARE @TenPercentDate datetime 
with cte (some_column) as (

    select top 10 percent some_column from some_table (nolock) order by some_column desc
)
select @TenPercentDate = min(some_column)
from cte

select @TenPercentDate

 SET @r = 1;

WHILE @r > 0
BEGIN
  BEGIN TRANSACTION;

 DELETE TOP (10000) from
  some_table 
  WHERE some_column < @TenPercentDate  

  SET @r = @@ROWCOUNT;
  print @r

  COMMIT TRANSACTION;

  --CHECKPOINT;    -- if simple

END

--rollback

Best Answer

I have three different ideas that I mentioned in a comment above. Here is a little elaboration on at least one of them (which you are stuck on due to self-diagnosed tunnel vision).

  1. Well, you could calculate the number of rows that make up 10% beforehand, and then compare that in your batch.

I was thinking about this but we don't really need to do the compare - we can just figure out what datetime value is at the edge of what we want to keep, and delete chunks of older rows until there are none left. For example:

SET NOCOUNT ON;

DECLARE 
  @rc INT = 1, 
  @cutoff DATETIME, 
  @batchsize INT = 10000;

;WITH x(dt) AS 
(
 SELECT TOP (10) PERCENT datetime_column
  FROM dbo.mytable 
  ORDER BY datetime_column DESC
)
SELECT TOP (1) @cutoff = dt -- earliest row we want to keep
 FROM x
 ORDER BY dt;

WHILE @rc > 0
BEGIN
  DELETE TOP (@batchsize) dbo.mytable
    WHERE datetime_column < @cutoff
  SET @rc = @@ROWCOUNT;
END

You can see this post for some additional ways to enhance this.

  1. You could consider drop / select into instead of truncate / insert, and also think about using bulk-logged recovery during this operation.

I think this would be a valid option and should cause less pain than the above.

  1. You could also just perform one data movement operation:
SELECT TOP 10 PERCENT cols
  INTO dbo.newtable
  FROM dbo.oldtable
  ORDER BY datetime_column;

DROP TABLE dbo.oldtable;

EXEC sys.sp_rename N'dbo.newtable', N'dbo.oldtable', N'OBJECT';

This one is also less painful, but feel free to inject some debugging in there to be sure that you got the data you wanted before you perform the drop.

Note that in both of these latter cases there may be constraints, schema-bound views, etc. that prevent you from just dropping the table (obviously an inbound foreign key is not an issue because you are able to truncate, but outbound may still need to be dealt with). There will also be ramifications to existing plans (well deleting 90% of the data should also cause stats to update and invalidate plans, so really no different), and as @Kenneth points out below, you will need to re-establish permissions on the new table (so you might want to be sure you can script those in advance).