PostgreSQL – Get Last 5 Distinct Values for Each ID

distinctgreatest-n-per-grouppostgresql

I'm working with PostgreSQL 9.4.

I have a table that contains the following entries:

 id | postcode | date_created
 ---+----------+-----------------
 14 | al2 2qp  | 2015-09-23 14:46:57
 14 | al2 2qp  | 2015-09-23 14:51:07
 14 | sp2 8ag  | 2015-09-23 14:56:11
 14 | se4      | 2015-09-23 16:12:05
 17 | e2       | 2015-09-23 16:15:35
 17 | fk20 8ru | 2015-09-23 16:28:35
 17 | fk20 8ru | 2015-09-23 16:35:51
 17 | se2      | 2015-09-23 16:36:17
 17 | fk20 8ru | 2015-09-23 16:36:22
 17 | fk20 8ru | 2015-09-23 16:37:04
 17 | se1      | 2015-09-23 16:37:11
 17 | fk20 8ru | 2015-09-23 16:37:15
 17 | se1 8ga  | 2015-09-24 09:52:46
 17 | se1      | 2015-09-24 10:01:19
 17 | hp27 9rz | 2015-09-24 10:05:27
 17 | hp27 9rz | 2015-09-24 10:05:29
 17 | se1      | 2015-09-24 10:19:46
 14 | tn21 8qb | 2015-09-24 14:49:05
 14 | tn21 8qb | 2015-09-24 15:42:45
 14 | tn21 8qb | 2015-09-24 17:38:06
 14 | n4 1ny   | 2015-09-25 14:49:10

What I want to achieve is a query that returns the 5 most recent unique postcode records for each id:

 id | postcode
 ---+---------
 14 | n4 1ny
 14 | tn21 8qb
 14 | se4
 14 | sp2 8ag
 14 | al2 2qp
 17 | se1
 17 | hp27 9rz
 17 | se1 8ga
 17 | fk20 8ru
 17 | se2

What would be the best way of achieving this? I've been playing around with subqueries but keep hitting walls when it comes to ordering them whilst doing a DISTINCT and GROUP BY.

Best Answer

There are probably many ways to do this. The first that comes to mind is to use window functions:

SELECT 
    id, postcode
FROM
  ( SELECT id, postcode, 
           ROW_NUMBER() OVER (PARTITION BY id
                              ORDER BY MAX(date_created) DESC
                             ) AS rn
    FROM tablename
    GROUP BY id, postcode
  ) AS t
WHERE
    rn <= 5
ORDER BY 
    id, rn ;

Test at SQLfiddle.

If there are ties, say the 5th, 6th and 7th postcode for an id have the same date_created, only one of them (choice will be arbitrary) will be in the results. If you want all the tied postcodes in those cases, use RANK() instead of ROW_NUMBER().

Another option is to use the LATERAL syntax. I'm not sure which will be more efficient, it wil probably depend on the values distribution of the two columns (id and postcode), i.e. how many distinct ids in the whole table, how many distinct postcodes per id and how many rows per (id, postcode) combinations.

SELECT 
    t.id, ti.postcode
FROM
    ( SELECT DISTINCT id
      FROM tablename
    ) AS t
  CROSS JOIN LATERAL
    ( SELECT tt.postcode,
             MAX(tt.date_created) AS date_created
      FROM tablename AS tt
      WHERE tt.id = t.id
      GROUP BY tt.postcode
      ORDER BY date_created DESC
      LIMIT 5
    ) AS ti 
ORDER BY 
    t.id, ti.date_created DESC;

Adding an index on (id, postcode, date_created) would be a good idea, too - or on (id, postcode, date_created DESC).

Related Solutions

PostgreSQL SELECT – How to SELECT n ON (Like DISTINCT ON, But More Than One of Each)

You can number and order id by us_state using the ROW_NUMBER() Window Function and only keep the n first values:

SELECT * 
FROM (
  SELECT *
    , ROW_NUMBER() OVER(PARTITION BY us_state ORDER BY id) as n
  FROM data
) as ord
WHERE n <= 2
ORDER BY us_state
;

Or you can CROSS JOIN with a subquery:

SELECT l.*
FROM (
  SELECT DISTINCT us_state FROM data
) as s
CROSS JOIN LATERAL (
  SELECT * 
  FROM data d
  WHERE d.us_state = s.us_state
  ORDER BY id
  LIMIT 2
) as l
ORDER BY l.us_state
;

Sample SQL Fiddle here
I used a small sample of 1 to 3 rows per state. Therefore I limit only to 2 values
I ordered it by ids but you can change that and order by whatever works best for you

Output with my small sample:

       id | us_state | n
      123 |       AL | 1 
      456 |       AL | 2 
 56781234 |       AZ | 1 
 78123456 |       CO | 1 
 45678123 |       FL | 1 
 81234567 |       FL | 2 
 34567812 |       GA | 1 
      123 |       MA | 1 
      456 |       MA | 2

Note that n is the result of ROW_NUMBER and doesn't exist in the second query. On a big table, an index on the partition (us-state) and order (id here) columns will help.

Sample used:

CREATE TABLE data
    ("id" int, "us_state" varchar(2))
;

INSERT INTO data
    ("id", "us_state")
VALUES
    (12345678, 'MA'),
    (123, 'MA'),
    (456, 'MA'),
    (23456781, 'AL'),
    (123, 'AL'),
    (456, 'AL'),
    (34567812, 'GA'),
    (45678123, 'FL'),
    (56781234, 'AZ'),
    (67812345, 'MA'),
    (78123456, 'CO'),
    (81234567, 'FL')
;

SQL Server – Get Last Two IDs from Rows for the Last Two Days

You can use the rank function to get the last two terminalid's:

select terminalid, date, toolid, toolname, workcenterid
from (
    select terminalid, date, toolid, toolname, workcenterid
         , dense_rank() over ( order by terminalid desc ) as rnk
    from T
    where date <= now() -- this will be dependent of your DBMS
) as X
where rnk <= 2;

Note that date is a reserved word so it is wise to name the column otherwise. Beside that it is a date of some kind (hard to tell from the question), say transaction_date in lack of a better understanding of your domain.

Best Answer

Related Solutions

PostgreSQL SELECT – How to SELECT n ON (Like DISTINCT ON, But More Than One of Each)

SQL Server – Get Last Two IDs from Rows for the Last Two Days

Related Question