postgresql,performance,postgresql-performance – Scaling ntile(N) Over a Set Where M < N

performancepostgresqlpostgresql-performance

Env: Postgres 9.4.4 on Mac OS X

Env: Postgres 9.3.6 on Heroku

I have a "large" dataset (75k) containing multiple groups (30), and for each group I am generating box plot percentiles. In most cases, I have more data points (M) per group than 100 (N), which allows me to get relatively accurate percentiles using NTILE(100).

However, in some cases, I have less than 100. My understanding of NTILE(N) is that it attempts to assign n over the entire range [1, N], and once all values have been assigned, it adds more entries to the buckets with a smaller n. This is likely easier to understand with an example, such as this one

with 
  v as (
    select * from (values 
      (1),(2),(3),(4),(5),(6),(7),(8),(9),
      (10),(11),(12),(13),(14),(15),(16),(17),(18),(19),
      (20),(21),(22),(23),(24),(25),(26),(27),(28),(29),
      (30),(31),(32),(33),(34),(35),(36),(37),(38),(39),
      (40),(41),(42),(43),(44),(45),(46),(47),(48),(49),
      (50),(51),(52),(53),(54),(55),(56),(57),(58),(59),
      (60),(61),(62),(63),(64),(65),(66),(67),(68),(69),
      (70),(71),(72),(73),(74),(75),(76),(77),(78),(79),
      (80),(81),(82),(83),(84),(85),(86),(87),(88),(89)
    ) as v 
    order by column1 asc
  ),
  p as ( 
    select 
      column1, 
      ntile(4) OVER (ORDER BY column1 ASC) as n4,
      ntile(40) OVER (ORDER BY column1 ASC) as n40,
      ntile(100) OVER (ORDER BY column1 ASC) as n100,
      (select count(*) from v) as pcount
    from v
  )

select * from p;

 column1 | n4 | n40 | n100 | pcount 
---------+----+-----+------+--------
       1 |  1 |   1 |    1 |     89
       2 |  1 |   1 |    2 |     89
       3 |  1 |   1 |    3 |     89
       4 |  1 |   2 |    4 |     89
       5 |  1 |   2 |    5 |     89
       6 |  1 |   2 |    6 |     89
       7 |  1 |   3 |    7 |     89
       8 |  1 |   3 |    8 |     89
       9 |  1 |   3 |    9 |     89
      10 |  1 |   4 |   10 |     89
      11 |  1 |   4 |   11 |     89
      12 |  1 |   4 |   12 |     89
      13 |  1 |   5 |   13 |     89
      14 |  1 |   5 |   14 |     89
      15 |  1 |   5 |   15 |     89
      16 |  1 |   6 |   16 |     89
      17 |  1 |   6 |   17 |     89
      18 |  1 |   6 |   18 |     89
      19 |  1 |   7 |   19 |     89
      20 |  1 |   7 |   20 |     89
      21 |  1 |   7 |   21 |     89
      22 |  1 |   8 |   22 |     89
      23 |  1 |   8 |   23 |     89
      24 |  2 |   8 |   24 |     89
      25 |  2 |   9 |   25 |     89
      26 |  2 |   9 |   26 |     89
      27 |  2 |   9 |   27 |     89
      28 |  2 |  10 |   28 |     89
      29 |  2 |  10 |   29 |     89
      30 |  2 |  11 |   30 |     89
      [...]
      85 |  4 |  38 |   85 |     89
      86 |  4 |  39 |   86 |     89
      87 |  4 |  39 |   87 |     89
      88 |  4 |  40 |   88 |     89
      89 |  4 |  40 |   89 |     89

Note how for n40, the groups with n < 10 each have 3 rows while the groups for n >= 10 have 2 rows.

When I have M > 100 data points, I'll search for a given percentile (say the 50th) using nested queries like this one and it's a reasonable approximation

select column1 
from p 
where n100 >= 50 
order by n100 
limit 1

When M >> N, it's more accurate. When M < N, I get gaps, so I looked at stretching the bucket numbers to cover N.

select column1 
from p 
where n100 >= round(0.5 * (select max(pcount) from p))
order by n100 
limit 1

This works, and I could also use it when M > N, though if M >> N, there's not much of a gain in accuracy for the complexity. However, it feels like I'm re-implementing something that likely would/should exist.

Is there a better approach to getting this information? My real-world query (not this example) is currently running in around 6 seconds, and I'd like to get it running faster. Using EXPLAIN ANALYZE, a big part of the time is spent looping for each group over the nested select statements.

And if there's nothing faster, is there a more succinct way of expressing these lookups? I want to produce one row (per group) with the percentile values to create a boxplot (p25, p50, p75).

Ideally, I wouldn't have to install a stats package as I'd love to keep this running on Heroku, which has the following extensions: btree_gin, btree_gist, chkpass, citext, cube, dblink, dict_int, earthdistance, fuzzystrmatch, hstore, intarray, isn, ltree, pg_stat_statements, pg_trgm, pgcrypto, pgrowlocks, pgstattuple, plpgsql, plv8, postgis, postgis_topology, postgres_fdw, tablefunc, unaccent, uuid-ossp

Full query plan

Here's the full query plan for my real-world query (75k rows, 30 groups).

http://explain.depesz.com/s/RDP

Best Answer

Starting with PostgreSQL 9.4 there's a Standard SQL percentile_disc aggregate function:

percentile_cont(0.25) WITHIN GROUP (ORDER BY column1), 
percentile_cont(0.5 ) WITHIN GROUP (ORDER BY column1), 
percentile_cont(0.75) WITHIN GROUP (ORDER BY column1), 

If you can't upgrade on your Heroku instance you can rewrite it. I did that a few years ago on Teradata, besides the proprietary QUALIFY it's similar syntax:

According to SQL:2008 PERCENTILE_DISC(x) is the first value with a CUME_DIST greater than or equal to x. This directly translates to the row with a

ROW_NUMBER() OVER (PARTITION BY groupcol ORDER BY ordercol)
= CEILING(COUNT(*) OVER (PARTITION BY groupcol) * x 


with 
  v as (
    select * from (values 
      (1),(2),(3),(4),(5),(6),(7),(8),(9),
      (10),(11),(12),(13),(14),(15),(16),(17),(18),(19),
      (20),(21),(22),(23),(24),(25),(26),(27),(28),(29),
      (30),(31),(32),(33),(34),(35),(36),(37),(38),(39),
      (40),(41),(42),(43),(44),(45),(46),(47),(48),(49),
      (50),(51),(52),(53),(54),(55),(56),(57),(58),(59),
      (60),(61),(62),(63),(64),(65),(66),(67),(68),(69),
      (70),(71),(72),(73),(74),(75),(76),(77),(78),(79),
      (80),(81),(82),(83),(84),(85),(86),(87),(88),(89)
    ) as v 
  ),
  p as ( 
    select 
      column1, 
      row_number() OVER (ORDER BY column1 ASC) as rn,
      count(*) OVER () as cnt
    from v
  )
 select * 
 from p
 where rn in (CEILING(cnt * 0.25)
             ,CEILING(cnt * 0.5)
             ,CEILING(cnt * 0.75))

See Fiddle