Sql-server – Delete rows from a large table with joins and aggregating (resulting in 1 % less rows in the big table)

sql serversql-server-2008-r2

I got a bit stuck on performance, so I thought about asking a bit of help. I have a working query, but unfortunatelly it feels rather slow and I know it's not as performant as it could be made (easily?). The situation is to delete rows from a table which contain around one million rows, by doing a join to another table likewise having around a million rows worth of data (these tables have about the same amount of rows). The result after removing rows from SomeList is that there's about 1 % less rows

The table structure is as follows

CREATE TABLE SomeList
(
    Id INT IDENTITY(1,1) PRIMARY KEY,
    TimeData DateTime NOT NULL
);

CREATE TABLE SomeListAuxData
(
    Id INT IDENTITY(1,1) PRIMARY KEY,
    CountData INT NOT NULL,
    SomeListId INT FOREIGN KEY REFERENCES SomeList(Id) ON DELETE CASCADE
);

And the query to delete as follows

DELETE
FROM SomeList 
WHERE TimeData < @someTime AND Id = @someListId AND
(
    SELECT COUNT(*)
    FROM SomeListAuxData
    WHERE SomeList.Id = SomeListAuxData.SomeListId
) > 0
AND
(
    SELECT MAX(CountData)
    FROM SomeListAuxData
    WHERE SomeList.Id = SomeListAuxData.SomeListId
) < @someValue;

Specifically this seem wasteful since I'm doing two subqueries, but I'm not sure how to go about the COUNT(*) and MAX(CountData) parts if I'd try joining, for instance.

This query is made in a program code where the someListIds are looped with same the @someValue (e.g. 2000) and @someTime (e.g. '2013-11-07 09:00:00.000') dates, so that too is rather slow. The query could be done for all the SomeList rows at once.

<edit: Also, I just learned that the code calling does have a list of pairs of type (@someListId, @someDate), which means the date isn't a constant. The input comes from the user interfaces and typically there may be even hundreds of such pairs.

Also, maybe scheduling a job at night may do the job. Though I'll need to checks Craig's advice which seem to be valid too. 🙂

<edit: I clarified what I meant by the "1 %" in my question. The point to make was that I don't think the data warrants collecting the non-removable rows to a #temptable, truncating SomeList and then moving the data from #temptable" back to SomeList. Craig's good point answered (partially) a different question, nevertheless it may be applicable. Sorry for my sloppy writing.

<edit 2013-11-08: The join conditions corrected as suggeted in comments.

Best Answer

You can create an inline function which does return only these rows which are to be deleted:

CREATE FUNCTION [uf_deletable_SomeLists]
(
    @someTime DATETIME,
    @someValue INT
)
RETURNS TABLE
AS
RETURN
(
    SELECT sl.[Id]
    FROM [SomeList] AS sl
    INNER JOIN [SomeListAuxData] AS sla
        ON sl.[Id] = sla.[SomeListId]
        AND sl.[TimeData] < @someTime
    GROUP BY sl.[Id]
    HAVING MAX(sla.[CountData]) < @someValue
)

If you would call the function on the following table set up

SELECT * FROM [SomeList]

Id          TimeData
----------- -----------------------
2           2013-11-06 16:55:50.280
3           2013-11-06 16:55:56.667
4           2013-11-06 16:56:24.217

SELECT * FROM [SomeListAuxData]

Id          CountData   SomeListId
----------- ----------- -----------
2           1           2
4           5           2
3           5           4
5           10          4

the result would look like this

SELECT * FROM [uf_deletable_SomeLists]('20140101', 10)

Id
-----------
2

Now you can combine the inline function with a delete statement using a join

DELETE [SomeList]
FROM [SomeList] AS sl
INNER JOIN [uf_deletable_SomeLists](@someTime, @someValue) AS d
    ON sl.[Id] = d.[Id]
    AND sl.[Id] = @someListId

If you want you can remove the @someListId, that will use the whole SomeList table. Or you add a second join with a temporary table containing SomeList Id's which you may want to delete.

You can include the percentage statement mentioned by Craig Efrein as well...

DELETE TOP(1) PERCENT [SomeList]
FROM [SomeList] AS sl (...)

To boost the select you could consider using an index. For my machine the execution plan was a bit better if I used the following index.

CREATE NONCLUSTERED INDEX [ix_SomeListAuxData_SomeListId] ON [SomeListAuxData]
    ( [SomeListId] ASC )
INCLUDE ( [CountData] )

UPDATE:
You can always take the query from the function and use it directly as a nested one for the delete statement. But creating the function gives you the option to reuse the query for example to check beforehand which would be deleted if you would call the delete. Incidentally the inline function is basically just a server-side stored parameterised query (I would have used a view if @someTime and @someValue would have been constants). The benefits of inline functions have been discussed in other posts or pages, i.e. see here.

Related Solutions

SQL Server – Methods to Speed Up a Huge DELETE FROM Table with No Clauses

What you can do is batch deletes like this:

SELECT 'Starting' --sets @@ROWCOUNT
WHILE @@ROWCOUNT <> 0
    DELETE TOP (xxx) MyTable

Where xxx is, say, 50000

A modification of this, if you want to remove a very high percentage of rows...

SELECT col1, col2, ... INTO #Holdingtable
           FROM MyTable WHERE ..some condition..

SELECT 'Starting' --sets @@ROWCOUNT
WHILE @@ROWCOUNT <> 0
    DELETE TOP (xxx) MyTable WHERE ...

INSERT MyTable (col1, col2, ...)
           SELECT col1, col2, ... FROM #Holdingtable

Sql-server – Indexing – Uniqueidentifier Foreign Key or Intermediary mapping table

Ok, I am making a lot of assumptions (INT instead of VARCHAR(50) being one of them) with this answer, so feel free to correct me if needed. The problem with option B is that it introduces a new join to relate Users to Alerts without any real added benefit. If joining on the UserID, it is best to index the UserID, so you can utilize seeks for your joins.

For Option A, UserID will be the clustering key (index key for the clustered index) on the Users table. UserID will be a nonclustered index key on Alerts table. This will cost 16 bytes per Alert.

For Option B, UserID will be the clustering key on the Users table. UserId will probably be the clustering key in UserMap too, to make joining more efficient. UserKey (assuming this is an INT) would then be a nonclustered index key on the Alerts table. This will cost 4 bytes per Alert. And 20 bytes per UserMap.

Looking at the big picture, one relationship, for Option A, costs 16 bytes of storage, and involves 1 join operation. Whereas, one relationship, for Option B, costs 24 bytes of storage, and involves 2 join operations.

Furthermore, there are a possibility of 340,282,366,920,938,000,000,000,000,000,000,000,000 uniqueidentifiers and only 4,294,967,296 INTs. Implementing a uniqueidentifier to INT map for a this type of relationship could cause unexpected results when you start reusing INTs.

The only reason for creating this type map table, is if you plan on creating a Many to Many relationship between Users and Alerts.

Taking all of this into consideration, I would recommend Option A.

I hope this helps,

Matt

Best Answer

Related Solutions

SQL Server – Methods to Speed Up a Huge DELETE FROM Table with No Clauses

Sql-server – Indexing – Uniqueidentifier Foreign Key or Intermediary mapping table

Related Question