PostgreSQL – Query to Obtain Weighted Percentile

aggregatepostgresqlpostgresql-10

Trying generate SQL to compute a weighted continuous value at a given set of percentile values (the 25%,50%,and 75% levels used below, but solution should allow for an arbitrary parameter level). In other words, want to find the interpolated "raw" values, weighted by "cnt", at each of the 25%, 50% and 75% cumulative percentiles for the test data in "source" table below.

NB: cnt represents the number of times that the raw value occurred during the sampling period, and the expected output would weight the raw value by cnt to arrive at the percentile (akin to quantile/ median and similar statistics)

Test data: (Table: source)

|  site  |  dateval   |  raw  |   cnt   |
+--------+------------+-------+---------+
|   A    | 2019-01-05 |   45  |      14 |
|   A    | 2019-01-05 |   52  |     178 |
|   A    | 2019-01-05 |   45  |       9 |
|   A    | 2019-01-05 |   37  |      75 |
|   A    | 2019-01-05 |   23  |      98 |
|   A    | 2019-01-05 |   78  |     102 |
|   A    | 2019-01-05 |   56  |       9 |
|   A    | 2019-01-05 |   17  |      54 |
|   A    | 2019-01-05 |   56  |       8 |
|   A    | 2019-01-06 |   33  |      35 |
|   A    | 2019-01-06 |   67  |      45 |
|   A    | 2019-01-06 |   65  |      93 |
|   A    | 2019-01-06 |   89  |     113 |
|   A    | 2019-01-06 |   52  |      64 |
|   A    | 2019-01-06 |  101  |      12 |
|   B    | 2019-01-05 |    5  |      25 |
|   B    | 2019-01-05 |   16  |      48 |
|   B    | 2019-01-05 |   12  |     107 |
|   B    | 2019-01-05 |   25  |      78 |
|   B    | 2019-01-05 |   44  |      53 |
|   B    | 2019-01-05 |    8  |      12 |
|   B    | 2019-01-05 |   31  |      32 |
|   B    | 2019-01-06 |   34  |      87 |
|   B    | 2019-01-06 |   18  |      35 |
|   B    | 2019-01-06 |   51  |      17 |
|   B    | 2019-01-06 |   22  |      23 |
|   B    | 2019-01-06 |   14  |      52 |
|   B    | 2019-01-06 |    6  |      34 |
+--------+------------+-------+---------+

Expected output (rounded to nearest 1/100th):

|  site  |   dateval  |   p00   |   p25   |   p50   |   p75   |   p100  |
+--------+------------+---------+---------+---------+---------+---------+
|   A    | 2019-01-05 |   17.00 |   22.07 |   45.92 |   51.30 |   78.00 |
|   A    | 2019-01-06 |   33.00 |   49.48 |   63.46 |   73.72 |  101.00 |
|   B    | 2019-01-05 |    5.00 |    9.93 |   14.79 |   24.57 |   44.00 |
|   B    | 2019-01-06 |    6.00 |   10.31 |   18.52 |   27.79 |   51.00 |
+--------+------------+---------+---------+---------+---------+---------+

NB: The above results assume linear smoothing between raw values. For instance, the p25value of 22.07 = [ (25.00% - 54/547) / ((98+54)/547 - 54/547) ] * (23-17) + 17, where 547 = sum(cnt) | site='A' & dateval='2019-01-05'.

Current SQL

The below computes percentile values at discreet points, based on the "raw" values present in table "source." However, the desired output is the "raw" value that corresponds to a given percentile on a continuous basis (for simplicity, the interpolation between discreet "raw" levels is linear instead of splines/other). Frankly, not sure the following approach is the most appropriate path:

WITH raw_lvl AS (
  SELECT "site", "dateval", "raw", sum("cnt") AS "sumcnt"
  FROM   source
  GROUP BY "site", "dateval", "raw"
), cum_raw AS (
  SELECT tlr.*, sum(tlr."sumcnt") OVER "win_cr" AS "cumsumcnt"
  FROM raw_lvl AS "tlr"
  WINDOW "win_cr" AS (PARTITION BY tlr."site", tlr."dateval" ORDER BY tlr."raw" ASC)
)
SELECT cr.*, cr."cumsumcnt"/(sum(cr."sumcnt") OVER "win_pr") AS "percentile" 
FROM cum_raw AS cr
WINDOW "win_pr" AS (PARTITION BY cr."site", cr."dateval");

Postgres version 10.3

Best Answer

Postgres has Ordered-Set Aggregate Functions for your purpose.

The special difficulty: you want rows "weighted" by cnt. If that's supposed to mean that each row represents cnt identical rows, you can multiply input rows by joining to generate_series(1, cnt):

SELECT site, dateval
     , percentile_cont('{0,.25,.5,.75,1}'::float8[]) WITHIN GROUP (ORDER BY raw)
FROM   source s, generate_series(1, s.cnt)
GROUP  BY 1, 2;

db<>fiddle here

But results differ from your expected output (except for the 0 and 100 percentile). So you are "weighting" differently ...

Aside, your original query can be simplified to this equivalent:

SELECT site, dateval, raw, sum(cnt) AS sumcnt
     , sum(sum(cnt)) OVER w AS cumsumcnt
     , sum(sum(cnt)) OVER w / sum(sum(cnt)) OVER (PARTITION BY site, dateval) AS percentile 
FROM   source
GROUP  BY site, dateval, raw
WINDOW w AS (PARTITION BY site, dateval ORDER BY raw);

You can run a window function over the result of an aggregate function in the same SELECT (but not vice versa). See:

I added a demo to the fiddle above.

But neither explains the odd numbers in your "expected results". Those strike me as incorrect, no matter how you interpolate. Example: 22.07 in the first line for p25 does not seem to make sense - the value 23 occupies all rows up to the 27.7879 percentile after factoring in cnt according to your own query ...