PostgreSQL – Enforcing Row Limits with DELETE on INSERT

deletepostgresqlpostgresql-11

Using PostgreSQL 11.
Consider a table like

CREATE TABLE "logs" 
    (
      "id" INTEGER NOT NULL,
      "userId" INTEGER NOT NULL, 
      "timestamp" TIMESTAMP NOT NULL,
      CONSTRAINT "PK_8d33b9f1a33b412e4865d1e5465" PRIMARY KEY ("id")
     )

Now, the requirement is that only 100 rows are stored per userId. If more data comes in, the oldest logs have to be deleted. If, for a short time, 101 rows are stored, it's not the end of the world, however. It's fine if the superfluous row gets deleted with a few seconds delay.

I cannot create a database TRIGGER. So, I need to write a query which is triggered on a log creation event on application layer.

Pure SQL is preferred over plpgsql.

This is the solution I came up with:

WITH "userLogs" AS (SELECT id, timestamp FROM "logs"
                    WHERE "userId" = $1
                ),
"countLogs" AS (SELECT count(id) FROM "userLogs")
        
DELETE FROM "logs" WHERE id = ANY
                (
                    SELECT id FROM "userLogs" 
                    ORDER BY "timestamp" ASC 
                    LIMIT GREATEST( (SELECT count FROM "countLogs") - 100, 0)
                );

Idea is: Always run a DELETE and base the decision if actually something has to be deleted on the LIMIT of a sub-query. If there are more than 100 logs, the sub-query will return the ids of the oldest ones to drop. Otherwise, LIMIT will be 0, the sub-query won't return anything and nothing gets deleted.

My questions are now:

  1. Is it sensitive to run a DELETE query on each INSERT – even if it
    doesn't delete anything?
  2. Are there any performance implications here? (Or other pitfalls I might not be aware of?)
  3. I am not quite sure if I need a LOCK. In my tests I could not
    produce any unexpected behavior when running INSERTs in parallel,
    however could it be that there are edge cases where I'd need a LOCK?

Edit: It's hard to predict how many times an INSERT will be run against that table. If all goes well (business-wise), it could be a few thousand times a day in sum – and a few dozens times per user each day.

Edit 2: timestamp values are not necessarily unique per user: there can be multiple log entries with the same timestamp and the same userId. It is expected that the table will get more columns containing what actually happened.

Best Answer

If you have an index on user_id, you can drop it and replace it with an index on (user_id,timestamp). This will also save a sort when displaying the latest log entries (WHERE user_id=... ORDER BY timestamp DESC LIMIT n).

Then:

SELECT timestamp FROM logs WHERE userid=1 ORDER BY timestamp DESC LIMIT 1 OFFSET 100

If there are more than 100 rows, this will return the timestamp of the 100th row. Otherwise it will return nothing. To delete the old logs for one user:

DELETE FROM logs WHERE userid=1 AND timestamp <=
(SELECT timestamp FROM logs WHERE userid=1 ORDER BY timestamp DESC LIMIT 1 OFFSET 100);

This is a very fast query. If the select doesn't find any rows to delete, it will be well under 1ms.

To delete all the old logs:

DELETE FROM logs
USING (SELECT userid, f.timestamp FROM users CROSS JOIN LATERAL (SELECT timestamp FROM logs WHERE logs.userid=users.userid ORDER BY timestamp DESC LIMIT 1 OFFSET 100) f) oldlogs
WHERE logs.userid=oldlogs.userid AND logs.timestamp<=oldlogs.timestamp;

This will probably seq-scan logs, so it could be slow. Here's a better one which will exploit the index on (userid,timestamp) and be fast if there is nothing to do:

DELETE FROM logs USING
( SELECT userid,timestamp FROM users 
  CROSS JOIN LATERAL (SELECT timestamp FROM logs WHERE logs.userid=users.userid ORDER BY timestamp DESC OFFSET 100) oldlogs ) o
WHERE logs.userid=o.userid AND logs.timestamp=o.timestamp;

To answer your comment "what if many logs all have the same timestamp?"... Well this should never happen since if you want your logs to be useful, they should be ordered by something unique, otherwise you don't know in what order they were logged. But... you can simply use the primary key:

-- one user
DELETE FROM logs USING
( SELECT id FROM logs WHERE logs.userid=123 ORDER BY timestamp DESC, id DESC OFFSET 100 ) o
WHERE logs.id=o.id;

-- all users
DELETE FROM logs USING
( SELECT oldlogs.id FROM users 
  CROSS JOIN LATERAL (SELECT id FROM logs WHERE logs.userid=users.userid ORDER BY timestamp DESC, id DESC OFFSET 100) oldlogs ) o
WHERE logs.id=o.id;

So if they have the same timestamp, the ORDER BY will keep the highest ids which should have been inserted last.