Using PostgreSQL 11.
Consider a table like
CREATE TABLE "logs"
(
"id" INTEGER NOT NULL,
"userId" INTEGER NOT NULL,
"timestamp" TIMESTAMP NOT NULL,
CONSTRAINT "PK_8d33b9f1a33b412e4865d1e5465" PRIMARY KEY ("id")
)
Now, the requirement is that only 100 rows are stored per userId
. If more data comes in, the oldest logs have to be deleted. If, for a short time, 101 rows are stored, it's not the end of the world, however. It's fine if the superfluous row gets deleted with a few seconds delay.
I cannot create a database TRIGGER
. So, I need to write a query which is triggered on a log creation event on application layer.
Pure SQL is preferred over plpgsql.
This is the solution I came up with:
WITH "userLogs" AS (SELECT id, timestamp FROM "logs"
WHERE "userId" = $1
),
"countLogs" AS (SELECT count(id) FROM "userLogs")
DELETE FROM "logs" WHERE id = ANY
(
SELECT id FROM "userLogs"
ORDER BY "timestamp" ASC
LIMIT GREATEST( (SELECT count FROM "countLogs") - 100, 0)
);
Idea is: Always run a DELETE
and base the decision if actually something has to be deleted on the LIMIT
of a sub-query. If there are more than 100 logs, the sub-query will return the ids of the oldest ones to drop. Otherwise, LIMIT
will be 0, the sub-query won't return anything and nothing gets deleted.
My questions are now:
- Is it sensitive to run a
DELETE
query on eachINSERT
– even if it
doesn't delete anything? - Are there any performance implications here? (Or other pitfalls I might not be aware of?)
- I am not quite sure if I need a
LOCK
. In my tests I could not
produce any unexpected behavior when runningINSERT
s in parallel,
however could it be that there are edge cases where I'd need aLOCK
?
Edit: It's hard to predict how many times an INSERT
will be run against that table. If all goes well (business-wise), it could be a few thousand times a day in sum – and a few dozens times per user each day.
Edit 2: timestamp
values are not necessarily unique per user: there can be multiple log entries with the same timestamp
and the same userId
. It is expected that the table will get more columns containing what actually happened.
Best Answer
If you have an index on user_id, you can drop it and replace it with an index on (user_id,timestamp). This will also save a sort when displaying the latest log entries (WHERE user_id=... ORDER BY timestamp DESC LIMIT n).
Then:
If there are more than 100 rows, this will return the timestamp of the 100th row. Otherwise it will return nothing. To delete the old logs for one user:
This is a very fast query. If the select doesn't find any rows to delete, it will be well under 1ms.
To delete all the old logs:
This will probably seq-scan logs, so it could be slow. Here's a better one which will exploit the index on (userid,timestamp) and be fast if there is nothing to do:
To answer your comment "what if many logs all have the same timestamp?"... Well this should never happen since if you want your logs to be useful, they should be ordered by something unique, otherwise you don't know in what order they were logged. But... you can simply use the primary key:
So if they have the same timestamp, the ORDER BY will keep the highest ids which should have been inserted last.