PostgreSQL – Select Row at the Nth Percentile

postgresqlwindow functions

I have two tables, table1 and table2. Let the two tables contain date, id and latency column.

I have a simple query that performs a join on the two tables and returns a set of rows:

Select table1.date,(table2.latency - table1.latency) as ans from table1, table2
where table1.id = table2.id order by ans;

I need to find the nth percentile row from the returned set of rows, lets say I need to find 90%, 99% and 99.9% percentile row from the data.

I need to display the data in a form like this:

    date       |   percentile  | ans
    01-12-1995 |    90         | 0.001563
    02-12-1999 |    99         | 0.0015
    05-12-2000 |    99.9       | 0.012

This is my first encounter to PostgreSQL. I am confused as to how should I proceed.

I was having a look at PERCENT_RANK() function. Please guide me in the correct direction.

Best Answer

Use the window function ntile() in a subquery (requires Postgres 8.4 or later).
Then select the segments you are interested in (corresponding to percentiles) and pick the row with the lowest value from it:

SELECT DISTINCT ON (segment)
       the_date, to_char((segment - 1)/ 10.0, '99.9') AS percentile, ans
FROM  (
    SELECT t1.the_date 
          ,ntile(1000) OVER (ORDER BY (t2.latency - t1.latency)) AS segment
          ,(t2.latency - t1.latency) AS ans
    FROM   table1 t1
    JOIN   table2 t2 ON t1.id = t2.id
   ) sub
WHERE  segment IN (601, 901, 991, 1000)
ORDER  BY segment, ans;

The Postgres-specific DISTINCT ON comes in handy for the last step. Detailed explanation in this related answer on SO:
Select first row in each GROUP BY group?

To get the 90, 99 and 99.9 percentile I picked the matching granularity with ntile(1000). And added a 60 percentile as per comment.

This algorithm picks the row at or above the exact value. You can add a line to the subquery with percent_rank() to get the exact relative rank of the select row in addition:

 percent_rank() OVER (ORDER BY (t2.latency - t1.latency)) AS pct_rank

Aside: I replaced the column name date with the_date since I am in the habbit of avoiding reserved SQL key words as identifiers, even if Postgres would permit them.