If you aim is to have queries with maximum efficiency, none of the above queries is really the best. Not always at least.
Efficiency depends on many different things, like the specific DBMS, the specific version (different versions have different improvements on the optimizer and the available syntax), the type of columns, the indexes available, the size of the tables and distribution of values, the hardware the server is running, the configuration settings etc.
You should always test various different ways of writing the queries, on your tables, with the sizes and distribution you expect to have on production, with your hardware and configuration settings, to decide which rewritings of the queries should be kept.
This specific kind of query is often called greatest-n-per-group
(there is even a tag for it!) and under certain assumptions, one of the many ways to write them, is often quite efficient in both MySQL and PostgreSQL. It uses a LATERAL
join in Postgres, which is available in 9.3+ versions (in SQL Server lingo CROSS/OUTER APPLY
) and a simulation of this join in MySQL.
The assumptions are that the number of authors (the attribute we group by on) is small, compared to the number of posts (the table where we apply the group by). It's also best if there is an index or a table to find all the distinct author_id
values and an additional index on the posts
table for the group by.
This solution to the greatest-n-per-group problem matches also your request about ties, as it returns always one result per group. If you want to be precise about which one (of the tied) will be returned, the ORDER BY
in the subquery can be modified (to ORDER BY pi.date DESC, pi.id DESC
or ORDER BY pi.date DESC, a.name
for example).
Query in PostgreSQL:
SELECT p.*
FROM authors AS a
, LATERAL
( SELECT pi.*
FROM posts AS pi
WHERE pi.author_id = a.author_id
ORDER BY pi.date DESC
LIMIT 1
) AS p ;
Query in MySQL:
SELECT p.*
FROM authors AS a
JOIN posts AS p
ON p.id =
( SELECT pi.id
FROM posts AS pi
WHERE pi.author_id = a.author_id
ORDER BY pi.date DESC
LIMIT 1
) ;
The useful index is on posts (author_id, date, id)
for MySQL and or on posts (author_id, date DESC)
for Postgres.
Needless to say again but before using any of the above, they should be tested in your environment and cross tested against all the many other versions/rewritings of the query. In Postgres for example, the DISTINCT ON
syntax can be used in version older than 9.3. The resulting query is more compact than the LATERAL
and might be more efficient, under different data distributions. Query:
SELECT DISTINCT ON (author_id) p.*
FROM posts AS p
ORDER BY p.author_id,
p.date DESC ;
I hope you are willing to do some programming outside of SQL.
To find the recipients
that need the most pruning:
SELECT recipient
FROM inbox
GROUP BY recipient
HAVING COUNT(*) > 30
ORDER BY COUNT(*) DESC
LIMIT 100
Then, for each recipient, first find the 31st id:
SELECT id
FROM inbox
WHERE recipient = $recipient
ORDER BY id DESC
LIMIT 30,1
And delete the excess baggage:
DELETE FROM inbox
WHERE recipient = $recipient
AND id < $id
You would need INDEX(recipient, id)
.
This process could be running continually. The first SQL would be the slowest, but it would run "Using index" and probably not be too bad, maybe 0.1 sec for 200K rows in inbox.
If you want to stay in SQL, this would let you do one recipient at a time:
BEGIN;
SELECT @recipient := recipient
FROM inbox
GROUP BY recipient
HAVING COUNT(*) > 30
ORDER BY COUNT(*) DESC
LIMIT 1
FOR UPDATE;
SELECT @id := id
FROM inbox
WHERE recipient = @recipient
ORDER BY id DESC
LIMIT 30,1
FOR UPDATE;
DELETE FROM inbox
WHERE recipient = @recipient
AND id < @id;
COMMIT;
Then, put that in a Stored Procedure with a loop around it. And, if you want to be further 'nice', add a SELECT SLEEP(1);
in the loop but outside the transaction.
(Trying to do the entire thing in a single statement makes my brain hurt.)
Best Answer
Why can't you use the MAX() function?
I did the following
MAX(myd) works perfectly. Could you show where
"I can't get MAX() to function with this."
is failing, along with the error message?