Postgresql – Capture row with highest value from each percentile

greatest-n-per-groupjoin;postgresqlpostgresql-performancetable

So far I've tried doing this:

SELECT MAX(buckets.position_id), NTILE
FROM  (
   SELECT position.position_id,NTILE(100) OVER (ORDER BY position.position_id)
   FROM position
   WHERE activity_id = 8
   ) AS buckets
GROUP BY  NTILE ORDER BY 2;

Which outputs something like this:

 max | ntile 
-----+-------
   3 |     1
   6 |     2
   9 |     3
  12 |     4
  15 |     5
  18 |     6
  21 |     7
  24 |     8
  27 |     9
  30 |    10
  33 |    11
  36 |    12

But this just shows the id along with the percentile of that id in the row, I need to show the complete row every time, so I tried doing this:

SELECT MAX(buckets.position_id), NTILE
FROM  (
   SELECT position.position_id,NTILE(100) OVER (ORDER BY position.position_id)
   FROM position
   WHERE activity_id = 8
   ) AS buckets
FULL JOIN position ON position.position_id = buckets.position_id
GROUP BY  NTILE ORDER BY 2;

Which shows the same output.

SELECT MAX(buckets.position_id), buckets, buckets.ntile
FROM  (
   SELECT position.position_id, position.lat, position.long, position.activity_id, position.position_created_at , NTILE(100) OVER (ORDER BY position.position_id)
   FROM position
   WHERE activity_id = 8
   ) AS buckets 
GROUP BY buckets, buckets.ntile
ORDER BY 3;

This outputs something like this, which is what I want, but it's repeating the NTILE instead of just getting the last one.

 max |                                       buckets                                       | ntile 
-----+-------------------------------------------------------------------------------------+-------
   1 | (1,20,90,12,12,8,"2021-03-12 21:07:47.8646",1)                                      |     1
   2 | (2,20,90,2,12,8,"2021-03-12 21:08:57.59163",1)                                      |     1
   3 | (3,20,90,2,12,8,"2021-03-12 21:09:21.355993",1)                                     |     1
   4 | (4,20,90,2,12,8,"2021-03-12 21:09:36.752904",2)                                     |     2
   5 | (5,20,90,2,12,8,"2021-03-12 21:09:45.815941",2)                                     |     2
   6 | (6,20,90,2.001,12.001,8,"2021-03-15 12:11:36.320269",2)                             |     2
   7 | (7,20,90,2.001,12.001,8,"2021-03-15 12:11:38.876406",3)                             |     3
   8 | (8,20,90,2.002,12.001999999999999,8,"2021-03-15 12:11:41.911301",3)                 |     3
   9 | (9,20,90,2.002,12.001999999999999,8,"2021-03-15 12:11:44.938797",3)                 |     3
  10 | (10,20,90,2.0029999999999997,12.002999999999998,8,"2021-03-15 12:11:47.975407",4)   |     4
  11 | (11,20,90,2.0029999999999997,12.002999999999998,8,"2021-03-15 12:11:51.003433",4)   |     4
  12 | (12,20,90,2.0039999999999996,12.003999999999998,8,"2021-03-15 12:11:54.035314",4)   |     4
  13 | (13,20,90,2.0039999999999996,12.003999999999998,8,"2021-03-15 12:11:57.067232",5)   |     5
  14 | (14,20,90,2.0049999999999994,12.004999999999997,8,"2021-03-15 12:12:00.099153",5)   |     5
  15 | (15,20,90,2.0049999999999994,12.004999999999997,8,"2021-03-15 12:12:03.131785",5)   |     5
  16 | (16,20,90,2.0059999999999993,12.005999999999997,8,"2021-03-15 12:12:06.171215",6)   |     6
  17 | (17,20,90,2.0059999999999993,12.005999999999997,8,"2021-03-15 12:12:09.200098",6)   |     6

But since those values change every row, the Max function messes up and it just shows every row instead of the last percentile…

I also tried wrapping everything inside MAX() functions, but I'm afraid that when I get a lower value from either lat or long it's going to mess up. (Also I don't think that's very efficient, correct me if I'm wrong.)

DB-fiddle for testing
https://www.db-fiddle.com/f/gKoqqmRa47g5546EMz5Hs/0

Best Answer

If I understand correctly your objective is this:

After ordering by position_id, get the row with the highest position_id for each of 100 percentiles.

DISTINCT ON does it:

SELECT DISTINCT ON (ntile)
       NTILE(100) OVER (ORDER BY position_id) AS ntile, *
FROM   position
WHERE  activity_id = 1
ORDER  BY ntile, position_id DESC;

db<>fiddle here

Since the PK is on (position_id, activity_id), position_id is unique for each activity_id and we need no tiebreaker.

Can be a single SELECT without subquery because DISTINCT ON is applied after window functions.

See: