PostgreSQL – Enforcing Row Limits with DELETE on INSERT

deletepostgresqlpostgresql-11

Using PostgreSQL 11.
Consider a table like

CREATE TABLE "logs" 
    (
      "id" INTEGER NOT NULL,
      "userId" INTEGER NOT NULL, 
      "timestamp" TIMESTAMP NOT NULL,
      CONSTRAINT "PK_8d33b9f1a33b412e4865d1e5465" PRIMARY KEY ("id")
     )

Now, the requirement is that only 100 rows are stored per userId. If more data comes in, the oldest logs have to be deleted. If, for a short time, 101 rows are stored, it's not the end of the world, however. It's fine if the superfluous row gets deleted with a few seconds delay.

I cannot create a database TRIGGER. So, I need to write a query which is triggered on a log creation event on application layer.

Pure SQL is preferred over plpgsql.

This is the solution I came up with:

WITH "userLogs" AS (SELECT id, timestamp FROM "logs"
                    WHERE "userId" = $1
                ),
"countLogs" AS (SELECT count(id) FROM "userLogs")
        
DELETE FROM "logs" WHERE id = ANY
                (
                    SELECT id FROM "userLogs" 
                    ORDER BY "timestamp" ASC 
                    LIMIT GREATEST( (SELECT count FROM "countLogs") - 100, 0)
                );

Idea is: Always run a DELETE and base the decision if actually something has to be deleted on the LIMIT of a sub-query. If there are more than 100 logs, the sub-query will return the ids of the oldest ones to drop. Otherwise, LIMIT will be 0, the sub-query won't return anything and nothing gets deleted.

My questions are now:

Is it sensitive to run a DELETE query on each INSERT – even if it
doesn't delete anything?
Are there any performance implications here? (Or other pitfalls I might not be aware of?)
I am not quite sure if I need a LOCK. In my tests I could not
produce any unexpected behavior when running INSERTs in parallel,
however could it be that there are edge cases where I'd need a LOCK?

Edit: It's hard to predict how many times an INSERT will be run against that table. If all goes well (business-wise), it could be a few thousand times a day in sum – and a few dozens times per user each day.

Edit 2: timestamp values are not necessarily unique per user: there can be multiple log entries with the same timestamp and the same userId. It is expected that the table will get more columns containing what actually happened.

Best Answer

If you have an index on user_id, you can drop it and replace it with an index on (user_id,timestamp). This will also save a sort when displaying the latest log entries (WHERE user_id=... ORDER BY timestamp DESC LIMIT n).

Then:

SELECT timestamp FROM logs WHERE userid=1 ORDER BY timestamp DESC LIMIT 1 OFFSET 100

If there are more than 100 rows, this will return the timestamp of the 100th row. Otherwise it will return nothing. To delete the old logs for one user:

DELETE FROM logs WHERE userid=1 AND timestamp <=
(SELECT timestamp FROM logs WHERE userid=1 ORDER BY timestamp DESC LIMIT 1 OFFSET 100);

This is a very fast query. If the select doesn't find any rows to delete, it will be well under 1ms.

To delete all the old logs:

DELETE FROM logs
USING (SELECT userid, f.timestamp FROM users CROSS JOIN LATERAL (SELECT timestamp FROM logs WHERE logs.userid=users.userid ORDER BY timestamp DESC LIMIT 1 OFFSET 100) f) oldlogs
WHERE logs.userid=oldlogs.userid AND logs.timestamp<=oldlogs.timestamp;

This will probably seq-scan logs, so it could be slow. Here's a better one which will exploit the index on (userid,timestamp) and be fast if there is nothing to do:

DELETE FROM logs USING
( SELECT userid,timestamp FROM users 
  CROSS JOIN LATERAL (SELECT timestamp FROM logs WHERE logs.userid=users.userid ORDER BY timestamp DESC OFFSET 100) oldlogs ) o
WHERE logs.userid=o.userid AND logs.timestamp=o.timestamp;

To answer your comment "what if many logs all have the same timestamp?"... Well this should never happen since if you want your logs to be useful, they should be ordered by something unique, otherwise you don't know in what order they were logged. But... you can simply use the primary key:

-- one user
DELETE FROM logs USING
( SELECT id FROM logs WHERE logs.userid=123 ORDER BY timestamp DESC, id DESC OFFSET 100 ) o
WHERE logs.id=o.id;

-- all users
DELETE FROM logs USING
( SELECT oldlogs.id FROM users 
  CROSS JOIN LATERAL (SELECT id FROM logs WHERE logs.userid=users.userid ORDER BY timestamp DESC, id DESC OFFSET 100) oldlogs ) o
WHERE logs.id=o.id;

So if they have the same timestamp, the ORDER BY will keep the highest ids which should have been inserted last.

Notes

The current version of the 9.3 major release is 9.3.6. The project recommends that ...

all users run the latest available minor release for whatever major version is in use.
A multicolumn index on (vendor, sku, effective_date, id) would be perfect for this - in this particular order. But Postgres can combine indexes rather efficiently, too.
It might pay to add the otherwise irrelevant price as last item ot the index to get index-only scans out of this. You'll have to test.
Since you have concurrent deletes it may be a good idea to run a separate delete per vendor to reduce the potential for race conditions and deadlocks. Since there are only a few vendors, this seems like a reasonable partitioning. (Many tiny calls would be comparatively slow.)
I am running a separate SELECT (PERFORM in plpgsql, since we do not use the result) because the row locking clause FOR UPDATE cannot be used together with window functions. Don't let the keyword mislead you, this is not just for updates. I am locking all rows for the given vendor, since the result depends on all rows. Concurrent reads are not impaired, only concurrent writes have to wait until we are done. That's another reason why deleting rows for one vendor at a time in a separate transaction should be best.
sku is unique per product, so we can PARTITION BY it.
ORDER BY effective_date, id: your first version of the question included code for duplicate rows, so I added id to ORDER BY as additional tie breaker. This way it works for duplicates on (sku, effective_date) as well.
To preserve the last row for each set: AND (lead(id) OVER w) IS NOT NULL. Reusing the same window for lead() is cheap - independent of the added explicit WINDOW clause - that's just syntax shorthand for convenience.
I am locking rows in the same order: ORDER BY sku, effective_date, id. Make sure that concurrent DELETEs operate in the same order to avoid deadlocks. If all other transactions delete no more than a single row within the same transaction, there cannot be deadlocks and you don't need the row locking at all.
If concurrent INSERTs could lead to a different result (make different rows obsolete), you have to lock the whole table in EXCLUSIVE mode instead to avoid race conditions:
```
LOCK TABLE vendor_prices IN EXCLUSIVE MODE;
```
Do that only if it's necessary. It blocks all concurrent write access.
I am returning the number of rows deleted, but that's totally optional. You might as well return nothing and declare the function as RETURNS void.

MySQL DELETE Rows with or without LIMIT

The limit is only going to help with exactly 100 so it is probably not worth the overhead.

If you put an index on the hash then it should know to stop (and where to start).

Best Answer

Related Solutions

PostgreSQL – How to Delete Duplicate Records Efficiently

Notes

MySQL DELETE Rows with or without LIMIT

Related Question