Mysql – Select ONE most recent post for each author

greatest-n-per-groupMySQLpostgresql

I'm sure its a simple question and I suppose it was asked many times, but I just can't figure it out from other answers, sorry.

I use recent versions of PostgreSQL and MySQL.
I have 2 tables:

CREATE TABLE authors (
    id INT,
    name VARCHAR
)

CREATE TABLE posts (
    id INT,
    author_id INT,
    text VARCHAR,
    date DATE
)

I need to select one most recent post for each author. Thanks!

UPDATE

Thanks, both links provide answers to my question with some exception.
All the following queries give the same result (which one is the most efficient btw?) The issue is when there is more than one post from the same author with the same date. Then the returned result set contains all such posts. How should I modify these queries to return exactly one post per author?

SELECT p1.*
FROM posts p1
LEFT JOIN posts p2 ON p1.author_id = p2.author_id AND p1.date < p2.date
WHERE p2.author_id IS NULL
ORDER BY p1.author_id;

SELECT p1.* 
FROM posts p1
INNER JOIN (
  SELECT author_id, MAX(date) AS max_date
  FROM posts
  GROUP BY author_id) p2
  ON p1.author_id = p2.author_id AND p1.date = p2.max_date
ORDER BY p1.author_id;

SELECT *
FROM posts p1
WHERE date = (SELECT MAX(p2.date)
              FROM posts p2
              WHERE p1.author_id = p2.author_id)
ORDER BY author_id;

SELECT * FROM (
    SELECT author_id, MAX(date) date
    FROM posts GROUP BY author_id
) p1 INNER JOIN posts p2 USING (author_id, date)
ORDER BY author_id;

Best Answer

If you aim is to have queries with maximum efficiency, none of the above queries is really the best. Not always at least.

Efficiency depends on many different things, like the specific DBMS, the specific version (different versions have different improvements on the optimizer and the available syntax), the type of columns, the indexes available, the size of the tables and distribution of values, the hardware the server is running, the configuration settings etc.

You should always test various different ways of writing the queries, on your tables, with the sizes and distribution you expect to have on production, with your hardware and configuration settings, to decide which rewritings of the queries should be kept.

This specific kind of query is often called greatest-n-per-group (there is even a tag for it!) and under certain assumptions, one of the many ways to write them, is often quite efficient in both MySQL and PostgreSQL. It uses a LATERAL join in Postgres, which is available in 9.3+ versions (in SQL Server lingo CROSS/OUTER APPLY) and a simulation of this join in MySQL.

The assumptions are that the number of authors (the attribute we group by on) is small, compared to the number of posts (the table where we apply the group by). It's also best if there is an index or a table to find all the distinct author_id values and an additional index on the posts table for the group by.

This solution to the greatest-n-per-group problem matches also your request about ties, as it returns always one result per group. If you want to be precise about which one (of the tied) will be returned, the ORDER BY in the subquery can be modified (to ORDER BY pi.date DESC, pi.id DESC or ORDER BY pi.date DESC, a.name for example).

Query in PostgreSQL:

SELECT p.* 
FROM authors AS a
   , LATERAL 
       ( SELECT pi.*
         FROM posts AS pi
         WHERE pi.author_id = a.author_id
         ORDER BY pi.date DESC
         LIMIT 1
       ) AS p ;

Query in MySQL:

SELECT p.* 
FROM authors AS a
  JOIN posts AS p
    ON p.id =
       ( SELECT pi.id
         FROM posts AS pi
         WHERE pi.author_id = a.author_id
         ORDER BY pi.date DESC
         LIMIT 1
       ) ;

The useful index is on posts (author_id, date, id) for MySQL and or on posts (author_id, date DESC) for Postgres.

Needless to say again but before using any of the above, they should be tested in your environment and cross tested against all the many other versions/rewritings of the query. In Postgres for example, the DISTINCT ON syntax can be used in version older than 9.3. The resulting query is more compact than the LATERAL and might be more efficient, under different data distributions. Query:

SELECT DISTINCT ON (author_id) p.*
FROM posts AS p
ORDER BY p.author_id,
         p.date DESC ;

Related Solutions

Mysql – Retrieving most recent X rows for each given user from a table

I hope you are willing to do some programming outside of SQL.

To find the recipients that need the most pruning:

SELECT recipient
    FROM inbox
    GROUP BY recipient
    HAVING COUNT(*) > 30
    ORDER BY COUNT(*) DESC
    LIMIT 100

Then, for each recipient, first find the 31st id:

SELECT id
    FROM inbox
    WHERE recipient = $recipient
    ORDER BY id DESC
    LIMIT 30,1

And delete the excess baggage:

DELETE FROM inbox
    WHERE recipient = $recipient
      AND id < $id

You would need INDEX(recipient, id).

This process could be running continually. The first SQL would be the slowest, but it would run "Using index" and probably not be too bad, maybe 0.1 sec for 200K rows in inbox.

If you want to stay in SQL, this would let you do one recipient at a time:

BEGIN;
SELECT @recipient := recipient
    FROM inbox
    GROUP BY recipient
    HAVING COUNT(*) > 30
    ORDER BY COUNT(*) DESC
    LIMIT 1
    FOR UPDATE;
SELECT @id := id
    FROM inbox
    WHERE recipient = @recipient
    ORDER BY id DESC
    LIMIT 30,1
    FOR UPDATE;
DELETE FROM inbox
    WHERE recipient = @recipient
      AND id < @id;
COMMIT;

Then, put that in a Stored Procedure with a loop around it. And, if you want to be further 'nice', add a SELECT SLEEP(1); in the loop but outside the transaction.

(Trying to do the entire thing in a single statement makes my brain hurt.)

Sql-server – Select only the most recent record

This might work. Basically it's a ROW_NUMBER function that you will have to identify a key for (you Mentioned InvoiceNumber). Once you do, it will return an ordered value where all your "duplicates" will be 2+. Simply adding a where clause where ROWNUM =1 should get you the first record (ordered by the CreatedDate).

SELECT main.* FROM
(
SELECT
Upper(WFI.COMPANYID) as Company,
WFI.USERID as UID,
INF.NAME as Approver,
CONVERT(varchar(10),duedatetime,4) as Due,
right(left(Document,20),10) as InvoiceDate,
SUBSTRING(DOCUMENT,22,charindex(' ',RIGHT(DOCUMENT,len(DOCUMENT)-21))) as         InvoiceNumber,
Right(document,len(Document)-(20+charindex(' ',RIGHT(DOCUMENT,len(DOCUMENT)-21)))) as Vendor,
wfi.CREATEDDATETIME as Created,
--Added RowNumber Function Below
ROW_NUMBER() OVER (PARTITION BY **InsertYourKeyToAUniqueRecordHere** ORDER BY wfi.CREATEDDATETIME DESC) AS ROWNUM

FROM
[TEST].[tst].[WORKFLOWWORKITEMTABLE] WFI
INNER JOIN [TEST].[tst].[Workflowtrackingstatustable] WFS ON WFI.CORRELATIONID=WFS.CORRELATIONID
INNER JOIN [TEST].[tst].[HCMWORKER] HCM on WFI.USERID=HCM.PERSONNELNUMBER
INNER JOIN [TEST].[tst].[DIRPERSONNAME] DPN ON DPN.PERSON=HCM.PERSON
INNER JOIN [TEST].[tst].[LEDGERJOURNALTABLE] LJT ON WFI.REFRECID = LJT.RECID
INNER JOIN [TEST].[tst].[USERINFO] INF ON WFI.USERID = INF.ID

WHERE
DATASOURCENAME Like 'Ledgerjourna%'
AND Datediff(day,Duedatetime,getdate())>3
AND WFS.DOCUMENTTYPE='Special'

)main

WHERE main.ROWNUM =1 --Add this clause to only return the first record
ORDER BY main.Company asc

Feel free to comment out the WHERE main.ROWNUM=1 clause so that you can see your Rownum in action.

If you don't want your ROWNUM column to show in your final result set, then just replace the first SELECT * with the actual columns you want to select (using their aliases).

Best Answer

Related Solutions

Mysql – Retrieving most recent X rows for each given user from a table

Sql-server – Select only the most recent record

Related Question