Trying generate SQL to compute a weighted continuous value at a given set of percentile values (the 25%,50%,and 75% levels used below, but solution should allow for an arbitrary parameter level). In other words, want to find the interpolated "raw" values, weighted by "cnt", at each of the 25%, 50% and 75% cumulative percentiles for the test data in "source" table below.
NB: cnt
represents the number of times that the raw
value occurred during the sampling period, and the expected output would weight the raw
value by cnt
to arrive at the percentile (akin to quantile/ median and similar statistics)
Test data: (Table: source)
| site | dateval | raw | cnt |
+--------+------------+-------+---------+
| A | 2019-01-05 | 45 | 14 |
| A | 2019-01-05 | 52 | 178 |
| A | 2019-01-05 | 45 | 9 |
| A | 2019-01-05 | 37 | 75 |
| A | 2019-01-05 | 23 | 98 |
| A | 2019-01-05 | 78 | 102 |
| A | 2019-01-05 | 56 | 9 |
| A | 2019-01-05 | 17 | 54 |
| A | 2019-01-05 | 56 | 8 |
| A | 2019-01-06 | 33 | 35 |
| A | 2019-01-06 | 67 | 45 |
| A | 2019-01-06 | 65 | 93 |
| A | 2019-01-06 | 89 | 113 |
| A | 2019-01-06 | 52 | 64 |
| A | 2019-01-06 | 101 | 12 |
| B | 2019-01-05 | 5 | 25 |
| B | 2019-01-05 | 16 | 48 |
| B | 2019-01-05 | 12 | 107 |
| B | 2019-01-05 | 25 | 78 |
| B | 2019-01-05 | 44 | 53 |
| B | 2019-01-05 | 8 | 12 |
| B | 2019-01-05 | 31 | 32 |
| B | 2019-01-06 | 34 | 87 |
| B | 2019-01-06 | 18 | 35 |
| B | 2019-01-06 | 51 | 17 |
| B | 2019-01-06 | 22 | 23 |
| B | 2019-01-06 | 14 | 52 |
| B | 2019-01-06 | 6 | 34 |
+--------+------------+-------+---------+
Expected output (rounded to nearest 1/100th):
| site | dateval | p00 | p25 | p50 | p75 | p100 |
+--------+------------+---------+---------+---------+---------+---------+
| A | 2019-01-05 | 17.00 | 22.07 | 45.92 | 51.30 | 78.00 |
| A | 2019-01-06 | 33.00 | 49.48 | 63.46 | 73.72 | 101.00 |
| B | 2019-01-05 | 5.00 | 9.93 | 14.79 | 24.57 | 44.00 |
| B | 2019-01-06 | 6.00 | 10.31 | 18.52 | 27.79 | 51.00 |
+--------+------------+---------+---------+---------+---------+---------+
NB: The above results assume linear smoothing between raw
values. For instance, the p25
value of 22.07
=
[ (25.00% - 54/547) / ((98+54)/547 - 54/547) ] * (23-17) + 17
, where 547 = sum(cnt) | site='A' & dateval='2019-01-05'
.
Current SQL
The below computes percentile values at discreet points, based on the "raw" values present in table "source." However, the desired output is the "raw" value that corresponds to a given percentile on a continuous basis (for simplicity, the interpolation between discreet "raw" levels is linear instead of splines/other). Frankly, not sure the following approach is the most appropriate path:
WITH raw_lvl AS (
SELECT "site", "dateval", "raw", sum("cnt") AS "sumcnt"
FROM source
GROUP BY "site", "dateval", "raw"
), cum_raw AS (
SELECT tlr.*, sum(tlr."sumcnt") OVER "win_cr" AS "cumsumcnt"
FROM raw_lvl AS "tlr"
WINDOW "win_cr" AS (PARTITION BY tlr."site", tlr."dateval" ORDER BY tlr."raw" ASC)
)
SELECT cr.*, cr."cumsumcnt"/(sum(cr."sumcnt") OVER "win_pr") AS "percentile"
FROM cum_raw AS cr
WINDOW "win_pr" AS (PARTITION BY cr."site", cr."dateval");
Postgres version 10.3
Best Answer
Postgres has Ordered-Set Aggregate Functions for your purpose.
The special difficulty: you want rows "weighted" by
cnt
. If that's supposed to mean that each row representscnt
identical rows, you can multiply input rows by joining togenerate_series(1, cnt)
:db<>fiddle here
But results differ from your expected output (except for the 0 and 100 percentile). So you are "weighting" differently ...
Aside, your original query can be simplified to this equivalent:
You can run a window function over the result of an aggregate function in the same
SELECT
(but not vice versa). See:I added a demo to the fiddle above.
But neither explains the odd numbers in your "expected results". Those strike me as incorrect, no matter how you interpolate. Example:
22.07
in the first line forp25
does not seem to make sense - the value23
occupies all rows up to the27.7879
percentile after factoring incnt
according to your own query ...