PostgreSQL – Select Row at the Nth Percentile

postgresqlwindow functions

I have two tables, table1 and table2. Let the two tables contain date, id and latency column.

I have a simple query that performs a join on the two tables and returns a set of rows:

Select table1.date,(table2.latency - table1.latency) as ans from table1, table2
where table1.id = table2.id order by ans;

I need to find the nth percentile row from the returned set of rows, lets say I need to find 90%, 99% and 99.9% percentile row from the data.

I need to display the data in a form like this:

    date       |   percentile  | ans
    01-12-1995 |    90         | 0.001563
    02-12-1999 |    99         | 0.0015
    05-12-2000 |    99.9       | 0.012

This is my first encounter to PostgreSQL. I am confused as to how should I proceed.

I was having a look at PERCENT_RANK() function. Please guide me in the correct direction.

Best Answer

Use the window function ntile() in a subquery (requires Postgres 8.4 or later).
Then select the segments you are interested in (corresponding to percentiles) and pick the row with the lowest value from it:

SELECT DISTINCT ON (segment)
       the_date, to_char((segment - 1)/ 10.0, '99.9') AS percentile, ans
FROM  (
    SELECT t1.the_date 
          ,ntile(1000) OVER (ORDER BY (t2.latency - t1.latency)) AS segment
          ,(t2.latency - t1.latency) AS ans
    FROM   table1 t1
    JOIN   table2 t2 ON t1.id = t2.id
   ) sub
WHERE  segment IN (601, 901, 991, 1000)
ORDER  BY segment, ans;

The Postgres-specific DISTINCT ON comes in handy for the last step. Detailed explanation in this related answer on SO:
Select first row in each GROUP BY group?

To get the 90, 99 and 99.9 percentile I picked the matching granularity with ntile(1000). And added a 60 percentile as per comment.

This algorithm picks the row at or above the exact value. You can add a line to the subquery with percent_rank() to get the exact relative rank of the select row in addition:

 percent_rank() OVER (ORDER BY (t2.latency - t1.latency)) AS pct_rank

Aside: I replaced the column name date with the_date since I am in the habbit of avoiding reserved SQL key words as identifiers, even if Postgres would permit them.

Related Solutions

Postgresql – Retrieving data from inner join using LIMIT and OFFSET

Use a subquery (as displayed) or CTE for that purpose:

SELECT *
FROM  (
   SELECT qid, gid
   FROM   table1
   ORDER  BY date DESC
   LIMIT  10
   OFFSET ?
   ) q
JOIN   table2 a USING (qid, gid)

USING (qid, gid) is just a shortcut for ON q.qid = a.qid AND q.gid = a.gid with the side effect that the two columns are only included once in the result.

Postgresql – Postgres not using the index even when rows returned is 5% of the table

From your query plans, it looks like you're comparing ints to ints in the first query plan, and int to numeric in the second plan.

Your first compare:

Index Cond: (("timestamp" >= 1431100800) AND ("timestamp" <= 1431108000))

and

timestamp >= 1431100800 and timestamp <= 1431108000

In the second query, it's numeric values:

Filter: ((numvalues[1] IS NOT NULL) AND (("timestamp")::numeric >= 1431100800.00) AND (("timestamp")::numeric <= 1431108000.00))

and

timestamp >= 1431093600.00 and timestamp <= 1431100800.00

Casting to numeric causes the index to be ignored in favor of a sequential scan.

You can see this with a very simple example, set up below:

CREATE TABLE t2 (a int);
CREATE INDEX t2_a_idx ON t2(a);
INSERT INTO t2 (a) SELECT i FROM generate_series(1,1000000) AS i;
VACUUM ANALYZE VERBOSE t2;

My first query plan looks like this:

EXPLAIN ANALYZE SELECT * FROM t2 WHERE a > 750000;

Index Only Scan using t2_a_idx on t2 (cost=0.42..7134.65 rows=250413 width=4) 
(actual time=0.019..29.926 rows=250000 loops=1)
Index Cond: (a > 750000)
Heap Fetches: 0
Planning time: 0.137 ms
Execution time: 39.114 ms
(5 rows)
Time: 39.540 ms

While a second query using numerics looks like this:

EXPLAIN ANALYZE SELECT * FROM t2 WHERE a > 750000.00;

Seq Scan on t2  (cost=0.00..19425.00 rows=333333 width=4) (actual time=122.803..175.326 rows=250000 loops=1)
Filter: ((a)::numeric > 750000.00)
Rows Removed by Filter: 750000
Planning time: 0.058 ms
Execution time: 184.194 ms
(5 rows)
Time: 184.487 ms

In the second instance here, the index is ignored in favor of a sequential scan because of the cast to a numeric value, which looks like exactly what's happening in your two examples.

One last aside, you might be able to speed your query up via a SET query before executing it as well:

SET work_mem = 2GB;

If your server can handle it, because your sorts are spilling to disk, as noted in this line from your query plan:

Sort Method: external merge  Disk: 1387704kB

Hope this helps. =)

Best Answer

Related Solutions

Postgresql – Retrieving data from inner join using LIMIT and OFFSET

Postgresql – Postgres not using the index even when rows returned is 5% of the table

Related Question