Mysql – Retrieving most recent X rows for each given user from a table

mariadbMySQLperformanceperformance-tuning

We've got a seemingly simple query that's running extremely slowly, but I can't quite figure out the best solution to the problem. The scenario is as follows.

We have an inbox table, which has the columns (with a few left out to keep things simple – the database structure is extremely old and not particularly well normalized):

  • id int(11) NOT NULL AUTO_INCREMENT PRIMARY KEY
  • recipient varchar(55) KEY
  • sender varchar(55) KEY
  • subject varchar(255) DEFAULT NULL
  • message mediumtext NOT NULL

This table has a number of rows, normally in the magnitude of 200,000+. The table uses the InnoDB engine.

One of our background tasks runs every night to try and manage the inbox to keep the table size down by archiving messages. It attempts to keep only the X most recent message for each recipient (where X is configurable – it is usually 30). Currently the system runs a query to get the IDs of the messages to keep, then deletes the messages where ID isn't in this list. The current query (which runs extremely slowly) to get the IDs to keep is:

SELECT  x.id
    FROM  inbox x
    JOIN  inbox y ON y.recipient = x.recipient
                 AND  y.id >= x.id
    GROUP BY (x.recipient), x.id
    HAVING  COUNT(*) <= 30; 

This query runs extremely slowly and bogs the whole system down, with MySQL using very large amounts of CPU time. I enabled slow query logging, and got the below output:

# Time: 160208 10:51:17
# User@Host: root[root] @  [127.0.0.1]
# Thread_id: 29  Schema: test QC_hit: No
# Query_time: 3568.641446  Lock_time: 0.000718  Rows_sent: 1312  Rows_examined: 585499843
SET timestamp=1454928677;
Select  x.id
    FROM  inbox x
    JOIN  inbox y ON y.recipient = x.recipient
                 AND  y.id >= x.id
    GROUP BY (x.recipient), x.id
    HAVING  COUNT(*) <= 512; 

The whole query seems like a bit of a mess, but I can't currently think of the best way to go about doing what I want. It's easy enough to get the single most recent message for each recipient, but I can't think of the best way of expanding that.

I've put together a quick SQL Fiddle with the database structure and a small subset of data, found here: http://sqlfiddle.com/#!9/04df7/1

I'm currently running MariaDB 10.0.23, which is the current most recent MariaDB 10.0 series release.

Best Answer

I hope you are willing to do some programming outside of SQL.

To find the recipients that need the most pruning:

SELECT recipient
    FROM inbox
    GROUP BY recipient
    HAVING COUNT(*) > 30
    ORDER BY COUNT(*) DESC
    LIMIT 100

Then, for each recipient, first find the 31st id:

SELECT id
    FROM inbox
    WHERE recipient = $recipient
    ORDER BY id DESC
    LIMIT 30,1

And delete the excess baggage:

DELETE FROM inbox
    WHERE recipient = $recipient
      AND id < $id

You would need INDEX(recipient, id).

This process could be running continually. The first SQL would be the slowest, but it would run "Using index" and probably not be too bad, maybe 0.1 sec for 200K rows in inbox.

If you want to stay in SQL, this would let you do one recipient at a time:

BEGIN;
SELECT @recipient := recipient
    FROM inbox
    GROUP BY recipient
    HAVING COUNT(*) > 30
    ORDER BY COUNT(*) DESC
    LIMIT 1
    FOR UPDATE;
SELECT @id := id
    FROM inbox
    WHERE recipient = @recipient
    ORDER BY id DESC
    LIMIT 30,1
    FOR UPDATE;
DELETE FROM inbox
    WHERE recipient = @recipient
      AND id < @id;
COMMIT;

Then, put that in a Stored Procedure with a loop around it. And, if you want to be further 'nice', add a SELECT SLEEP(1); in the loop but outside the transaction.

(Trying to do the entire thing in a single statement makes my brain hurt.)