Env: Postgres 9.4.4 on Mac OS X
Env: Postgres 9.3.6 on Heroku
I have a "large" dataset (75k) containing multiple groups (30), and for each group I am generating box plot percentiles. In most cases, I have more data points (M
) per group than 100 (N
), which allows me to get relatively accurate percentiles using NTILE(100)
.
However, in some cases, I have less than 100. My understanding of NTILE(N)
is that it attempts to assign n
over the entire range [1, N]
, and once all values have been assigned, it adds more entries to the buckets with a smaller n
. This is likely easier to understand with an example, such as this one
with
v as (
select * from (values
(1),(2),(3),(4),(5),(6),(7),(8),(9),
(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),
(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),
(30),(31),(32),(33),(34),(35),(36),(37),(38),(39),
(40),(41),(42),(43),(44),(45),(46),(47),(48),(49),
(50),(51),(52),(53),(54),(55),(56),(57),(58),(59),
(60),(61),(62),(63),(64),(65),(66),(67),(68),(69),
(70),(71),(72),(73),(74),(75),(76),(77),(78),(79),
(80),(81),(82),(83),(84),(85),(86),(87),(88),(89)
) as v
order by column1 asc
),
p as (
select
column1,
ntile(4) OVER (ORDER BY column1 ASC) as n4,
ntile(40) OVER (ORDER BY column1 ASC) as n40,
ntile(100) OVER (ORDER BY column1 ASC) as n100,
(select count(*) from v) as pcount
from v
)
select * from p;
column1 | n4 | n40 | n100 | pcount
---------+----+-----+------+--------
1 | 1 | 1 | 1 | 89
2 | 1 | 1 | 2 | 89
3 | 1 | 1 | 3 | 89
4 | 1 | 2 | 4 | 89
5 | 1 | 2 | 5 | 89
6 | 1 | 2 | 6 | 89
7 | 1 | 3 | 7 | 89
8 | 1 | 3 | 8 | 89
9 | 1 | 3 | 9 | 89
10 | 1 | 4 | 10 | 89
11 | 1 | 4 | 11 | 89
12 | 1 | 4 | 12 | 89
13 | 1 | 5 | 13 | 89
14 | 1 | 5 | 14 | 89
15 | 1 | 5 | 15 | 89
16 | 1 | 6 | 16 | 89
17 | 1 | 6 | 17 | 89
18 | 1 | 6 | 18 | 89
19 | 1 | 7 | 19 | 89
20 | 1 | 7 | 20 | 89
21 | 1 | 7 | 21 | 89
22 | 1 | 8 | 22 | 89
23 | 1 | 8 | 23 | 89
24 | 2 | 8 | 24 | 89
25 | 2 | 9 | 25 | 89
26 | 2 | 9 | 26 | 89
27 | 2 | 9 | 27 | 89
28 | 2 | 10 | 28 | 89
29 | 2 | 10 | 29 | 89
30 | 2 | 11 | 30 | 89
[...]
85 | 4 | 38 | 85 | 89
86 | 4 | 39 | 86 | 89
87 | 4 | 39 | 87 | 89
88 | 4 | 40 | 88 | 89
89 | 4 | 40 | 89 | 89
Note how for n40
, the groups with n < 10
each have 3 rows while the groups for n >= 10
have 2 rows.
When I have M > 100
data points, I'll search for a given percentile (say the 50th) using nested queries like this one and it's a reasonable approximation
select column1
from p
where n100 >= 50
order by n100
limit 1
When M >> N, it's more accurate. When M < N, I get gaps, so I looked at stretching the bucket numbers to cover N
.
select column1
from p
where n100 >= round(0.5 * (select max(pcount) from p))
order by n100
limit 1
This works, and I could also use it when M > N, though if M >> N, there's not much of a gain in accuracy for the complexity. However, it feels like I'm re-implementing something that likely would/should exist.
Is there a better approach to getting this information? My real-world query (not this example) is currently running in around 6 seconds, and I'd like to get it running faster. Using EXPLAIN ANALYZE
, a big part of the time is spent looping for each group over the nested select statements.
And if there's nothing faster, is there a more succinct way of expressing these lookups? I want to produce one row (per group) with the percentile values to create a boxplot (p25, p50, p75).
Ideally, I wouldn't have to install a stats package as I'd love to keep this running on Heroku, which has the following extensions: btree_gin, btree_gist, chkpass, citext, cube, dblink, dict_int, earthdistance, fuzzystrmatch, hstore, intarray, isn, ltree, pg_stat_statements, pg_trgm, pgcrypto, pgrowlocks, pgstattuple, plpgsql, plv8, postgis, postgis_topology, postgres_fdw, tablefunc, unaccent, uuid-ossp
Full query plan
Here's the full query plan for my real-world query (75k rows, 30 groups).
Best Answer
Starting with PostgreSQL 9.4 there's a Standard SQL
percentile_disc
aggregate function:If you can't upgrade on your Heroku instance you can rewrite it. I did that a few years ago on Teradata, besides the proprietary
QUALIFY
it's similar syntax:According to SQL:2008
PERCENTILE_DISC(x)
is the first value with aCUME_DIST
greater than or equal to x. This directly translates to the row with aSee Fiddle