PostgreSQL – Get Last 5 Distinct Values for Each ID

distinctgreatest-n-per-grouppostgresql

I'm working with PostgreSQL 9.4.

I have a table that contains the following entries:

 id | postcode | date_created
 ---+----------+-----------------
 14 | al2 2qp  | 2015-09-23 14:46:57
 14 | al2 2qp  | 2015-09-23 14:51:07
 14 | sp2 8ag  | 2015-09-23 14:56:11
 14 | se4      | 2015-09-23 16:12:05
 17 | e2       | 2015-09-23 16:15:35
 17 | fk20 8ru | 2015-09-23 16:28:35
 17 | fk20 8ru | 2015-09-23 16:35:51
 17 | se2      | 2015-09-23 16:36:17
 17 | fk20 8ru | 2015-09-23 16:36:22
 17 | fk20 8ru | 2015-09-23 16:37:04
 17 | se1      | 2015-09-23 16:37:11
 17 | fk20 8ru | 2015-09-23 16:37:15
 17 | se1 8ga  | 2015-09-24 09:52:46
 17 | se1      | 2015-09-24 10:01:19
 17 | hp27 9rz | 2015-09-24 10:05:27
 17 | hp27 9rz | 2015-09-24 10:05:29
 17 | se1      | 2015-09-24 10:19:46
 14 | tn21 8qb | 2015-09-24 14:49:05
 14 | tn21 8qb | 2015-09-24 15:42:45
 14 | tn21 8qb | 2015-09-24 17:38:06
 14 | n4 1ny   | 2015-09-25 14:49:10

What I want to achieve is a query that returns the 5 most recent unique postcode records for each id:

 id | postcode
 ---+---------
 14 | n4 1ny
 14 | tn21 8qb
 14 | se4
 14 | sp2 8ag
 14 | al2 2qp
 17 | se1
 17 | hp27 9rz
 17 | se1 8ga
 17 | fk20 8ru
 17 | se2

What would be the best way of achieving this? I've been playing around with subqueries but keep hitting walls when it comes to ordering them whilst doing a DISTINCT and GROUP BY.

Best Answer

There are probably many ways to do this. The first that comes to mind is to use window functions:

SELECT 
    id, postcode
FROM
  ( SELECT id, postcode, 
           ROW_NUMBER() OVER (PARTITION BY id
                              ORDER BY MAX(date_created) DESC
                             ) AS rn
    FROM tablename
    GROUP BY id, postcode
  ) AS t
WHERE
    rn <= 5
ORDER BY 
    id, rn ;

Test at SQLfiddle.

If there are ties, say the 5th, 6th and 7th postcode for an id have the same date_created, only one of them (choice will be arbitrary) will be in the results. If you want all the tied postcodes in those cases, use RANK() instead of ROW_NUMBER().


Another option is to use the LATERAL syntax. I'm not sure which will be more efficient, it wil probably depend on the values distribution of the two columns (id and postcode), i.e. how many distinct ids in the whole table, how many distinct postcodes per id and how many rows per (id, postcode) combinations.

SELECT 
    t.id, ti.postcode
FROM
    ( SELECT DISTINCT id
      FROM tablename
    ) AS t
  CROSS JOIN LATERAL
    ( SELECT tt.postcode,
             MAX(tt.date_created) AS date_created
      FROM tablename AS tt
      WHERE tt.id = t.id
      GROUP BY tt.postcode
      ORDER BY date_created DESC
      LIMIT 5
    ) AS ti 
ORDER BY 
    t.id, ti.date_created DESC;

Adding an index on (id, postcode, date_created) would be a good idea, too - or on (id, postcode, date_created DESC).