Sql-server – Delete rows from a large table with joins and aggregating (resulting in 1 % less rows in the big table)

sql serversql-server-2008-r2

I got a bit stuck on performance, so I thought about asking a bit of help. I have a working query, but unfortunatelly it feels rather slow and I know it's not as performant as it could be made (easily?). The situation is to delete rows from a table which contain around one million rows, by doing a join to another table likewise having around a million rows worth of data (these tables have about the same amount of rows). The result after removing rows from SomeList is that there's about 1 % less rows

The table structure is as follows

CREATE TABLE SomeList
(
    Id INT IDENTITY(1,1) PRIMARY KEY,
    TimeData DateTime NOT NULL
);

CREATE TABLE SomeListAuxData
(
    Id INT IDENTITY(1,1) PRIMARY KEY,
    CountData INT NOT NULL,
    SomeListId INT FOREIGN KEY REFERENCES SomeList(Id) ON DELETE CASCADE
);

And the query to delete as follows

DELETE
FROM SomeList 
WHERE TimeData < @someTime AND Id = @someListId AND
(
    SELECT COUNT(*)
    FROM SomeListAuxData
    WHERE SomeList.Id = SomeListAuxData.SomeListId
) > 0
AND
(
    SELECT MAX(CountData)
    FROM SomeListAuxData
    WHERE SomeList.Id = SomeListAuxData.SomeListId
) < @someValue;

Specifically this seem wasteful since I'm doing two subqueries, but I'm not sure how to go about the COUNT(*) and MAX(CountData) parts if I'd try joining, for instance.

This query is made in a program code where the someListIds are looped with same the @someValue (e.g. 2000) and @someTime (e.g. '2013-11-07 09:00:00.000') dates, so that too is rather slow. The query could be done for all the SomeList rows at once.

<edit: Also, I just learned that the code calling does have a list of pairs of type (@someListId, @someDate), which means the date isn't a constant. The input comes from the user interfaces and typically there may be even hundreds of such pairs.

Also, maybe scheduling a job at night may do the job. Though I'll need to checks Craig's advice which seem to be valid too. 🙂

<edit: I clarified what I meant by the "1 %" in my question. The point to make was that I don't think the data warrants collecting the non-removable rows to a #temptable, truncating SomeList and then moving the data from #temptable" back to SomeList. Craig's good point answered (partially) a different question, nevertheless it may be applicable. Sorry for my sloppy writing.

<edit 2013-11-08: The join conditions corrected as suggeted in comments.

Best Answer

You can create an inline function which does return only these rows which are to be deleted:

CREATE FUNCTION [uf_deletable_SomeLists]
(
    @someTime DATETIME,
    @someValue INT
)
RETURNS TABLE
AS
RETURN
(
    SELECT sl.[Id]
    FROM [SomeList] AS sl
    INNER JOIN [SomeListAuxData] AS sla
        ON sl.[Id] = sla.[SomeListId]
        AND sl.[TimeData] < @someTime
    GROUP BY sl.[Id]
    HAVING MAX(sla.[CountData]) < @someValue
)

If you would call the function on the following table set up

SELECT * FROM [SomeList]

Id          TimeData
----------- -----------------------
2           2013-11-06 16:55:50.280
3           2013-11-06 16:55:56.667
4           2013-11-06 16:56:24.217

SELECT * FROM [SomeListAuxData]

Id          CountData   SomeListId
----------- ----------- -----------
2           1           2
4           5           2
3           5           4
5           10          4

the result would look like this

SELECT * FROM [uf_deletable_SomeLists]('20140101', 10)

Id
-----------
2

Now you can combine the inline function with a delete statement using a join

DELETE [SomeList]
FROM [SomeList] AS sl
INNER JOIN [uf_deletable_SomeLists](@someTime, @someValue) AS d
    ON sl.[Id] = d.[Id]
    AND sl.[Id] = @someListId

If you want you can remove the @someListId, that will use the whole SomeList table. Or you add a second join with a temporary table containing SomeList Id's which you may want to delete.

You can include the percentage statement mentioned by Craig Efrein as well...

DELETE TOP(1) PERCENT [SomeList]
FROM [SomeList] AS sl (...)

To boost the select you could consider using an index. For my machine the execution plan was a bit better if I used the following index.

CREATE NONCLUSTERED INDEX [ix_SomeListAuxData_SomeListId] ON [SomeListAuxData]
    ( [SomeListId] ASC )
INCLUDE ( [CountData] )


UPDATE:
You can always take the query from the function and use it directly as a nested one for the delete statement. But creating the function gives you the option to reuse the query for example to check beforehand which would be deleted if you would call the delete. Incidentally the inline function is basically just a server-side stored parameterised query (I would have used a view if @someTime and @someValue would have been constants). The benefits of inline functions have been discussed in other posts or pages, i.e. see here.