PostgreSQL – Query to Obtain Weighted Percentile

aggregatepostgresqlpostgresql-10

Trying generate SQL to compute a weighted continuous value at a given set of percentile values (the 25%,50%,and 75% levels used below, but solution should allow for an arbitrary parameter level). In other words, want to find the interpolated "raw" values, weighted by "cnt", at each of the 25%, 50% and 75% cumulative percentiles for the test data in "source" table below.

NB: cnt represents the number of times that the raw value occurred during the sampling period, and the expected output would weight the raw value by cnt to arrive at the percentile (akin to quantile/ median and similar statistics)

Test data: (Table: source)

|  site  |  dateval   |  raw  |   cnt   |
+--------+------------+-------+---------+
|   A    | 2019-01-05 |   45  |      14 |
|   A    | 2019-01-05 |   52  |     178 |
|   A    | 2019-01-05 |   45  |       9 |
|   A    | 2019-01-05 |   37  |      75 |
|   A    | 2019-01-05 |   23  |      98 |
|   A    | 2019-01-05 |   78  |     102 |
|   A    | 2019-01-05 |   56  |       9 |
|   A    | 2019-01-05 |   17  |      54 |
|   A    | 2019-01-05 |   56  |       8 |
|   A    | 2019-01-06 |   33  |      35 |
|   A    | 2019-01-06 |   67  |      45 |
|   A    | 2019-01-06 |   65  |      93 |
|   A    | 2019-01-06 |   89  |     113 |
|   A    | 2019-01-06 |   52  |      64 |
|   A    | 2019-01-06 |  101  |      12 |
|   B    | 2019-01-05 |    5  |      25 |
|   B    | 2019-01-05 |   16  |      48 |
|   B    | 2019-01-05 |   12  |     107 |
|   B    | 2019-01-05 |   25  |      78 |
|   B    | 2019-01-05 |   44  |      53 |
|   B    | 2019-01-05 |    8  |      12 |
|   B    | 2019-01-05 |   31  |      32 |
|   B    | 2019-01-06 |   34  |      87 |
|   B    | 2019-01-06 |   18  |      35 |
|   B    | 2019-01-06 |   51  |      17 |
|   B    | 2019-01-06 |   22  |      23 |
|   B    | 2019-01-06 |   14  |      52 |
|   B    | 2019-01-06 |    6  |      34 |
+--------+------------+-------+---------+

Expected output (rounded to nearest 1/100th):

|  site  |   dateval  |   p00   |   p25   |   p50   |   p75   |   p100  |
+--------+------------+---------+---------+---------+---------+---------+
|   A    | 2019-01-05 |   17.00 |   22.07 |   45.92 |   51.30 |   78.00 |
|   A    | 2019-01-06 |   33.00 |   49.48 |   63.46 |   73.72 |  101.00 |
|   B    | 2019-01-05 |    5.00 |    9.93 |   14.79 |   24.57 |   44.00 |
|   B    | 2019-01-06 |    6.00 |   10.31 |   18.52 |   27.79 |   51.00 |
+--------+------------+---------+---------+---------+---------+---------+

NB: The above results assume linear smoothing between raw values. For instance, the p25value of 22.07 = [ (25.00% - 54/547) / ((98+54)/547 - 54/547) ] * (23-17) + 17, where 547 = sum(cnt) | site='A' & dateval='2019-01-05'.

Current SQL

The below computes percentile values at discreet points, based on the "raw" values present in table "source." However, the desired output is the "raw" value that corresponds to a given percentile on a continuous basis (for simplicity, the interpolation between discreet "raw" levels is linear instead of splines/other). Frankly, not sure the following approach is the most appropriate path:

WITH raw_lvl AS (
  SELECT "site", "dateval", "raw", sum("cnt") AS "sumcnt"
  FROM   source
  GROUP BY "site", "dateval", "raw"
), cum_raw AS (
  SELECT tlr.*, sum(tlr."sumcnt") OVER "win_cr" AS "cumsumcnt"
  FROM raw_lvl AS "tlr"
  WINDOW "win_cr" AS (PARTITION BY tlr."site", tlr."dateval" ORDER BY tlr."raw" ASC)
)
SELECT cr.*, cr."cumsumcnt"/(sum(cr."sumcnt") OVER "win_pr") AS "percentile" 
FROM cum_raw AS cr
WINDOW "win_pr" AS (PARTITION BY cr."site", cr."dateval");

Postgres version 10.3

Best Answer

Postgres has Ordered-Set Aggregate Functions for your purpose.

The special difficulty: you want rows "weighted" by cnt. If that's supposed to mean that each row represents cnt identical rows, you can multiply input rows by joining to generate_series(1, cnt):

SELECT site, dateval
     , percentile_cont('{0,.25,.5,.75,1}'::float8[]) WITHIN GROUP (ORDER BY raw)
FROM   source s, generate_series(1, s.cnt)
GROUP  BY 1, 2;

db<>fiddle here

But results differ from your expected output (except for the 0 and 100 percentile). So you are "weighting" differently ...

Aside, your original query can be simplified to this equivalent:

SELECT site, dateval, raw, sum(cnt) AS sumcnt
     , sum(sum(cnt)) OVER w AS cumsumcnt
     , sum(sum(cnt)) OVER w / sum(sum(cnt)) OVER (PARTITION BY site, dateval) AS percentile 
FROM   source
GROUP  BY site, dateval, raw
WINDOW w AS (PARTITION BY site, dateval ORDER BY raw);

You can run a window function over the result of an aggregate function in the same SELECT (but not vice versa). See:

Need separate columns that calculate percentage

I added a demo to the fiddle above.

But neither explains the odd numbers in your "expected results". Those strike me as incorrect, no matter how you interpolate. Example: 22.07 in the first line for p25 does not seem to make sense - the value 23 occupies all rows up to the 27.7879 percentile after factoring in cnt according to your own query ...

Related Solutions

PostgreSQL – Select Row at the Nth Percentile

Use the window function ntile() in a subquery (requires Postgres 8.4 or later).
Then select the segments you are interested in (corresponding to percentiles) and pick the row with the lowest value from it:

SELECT DISTINCT ON (segment)
       the_date, to_char((segment - 1)/ 10.0, '99.9') AS percentile, ans
FROM  (
    SELECT t1.the_date 
          ,ntile(1000) OVER (ORDER BY (t2.latency - t1.latency)) AS segment
          ,(t2.latency - t1.latency) AS ans
    FROM   table1 t1
    JOIN   table2 t2 ON t1.id = t2.id
   ) sub
WHERE  segment IN (601, 901, 991, 1000)
ORDER  BY segment, ans;

The Postgres-specific DISTINCT ON comes in handy for the last step. Detailed explanation in this related answer on SO:
Select first row in each GROUP BY group?

To get the 90, 99 and 99.9 percentile I picked the matching granularity with ntile(1000). And added a 60 percentile as per comment.

This algorithm picks the row at or above the exact value. You can add a line to the subquery with percent_rank() to get the exact relative rank of the select row in addition:

 percent_rank() OVER (ORDER BY (t2.latency - t1.latency)) AS pct_rank

Aside: I replaced the column name date with the_date since I am in the habbit of avoiding reserved SQL key words as identifiers, even if Postgres would permit them.

PostgreSQL 9.4 – How to Get 10th and 90th Percentile by Customer

It seems you are after the percentile_disc() ordered-set aggregate function.

The documentation says the following about it:

percentile_disc(fraction) WITHIN GROUP (ORDER BY sort_expression)

discrete percentile: returns the first input value whose position in the ordering equals or exceeds the specified fraction

The syntax is a bit strange for an aggregate, but using it is easy:

SELECT percentile_disc(0.9) WITHIN GROUP (ORDER BY score)
  FROM customer_score
 GROUP BY customer_id;

You define the column from which to take the percentile in the ORDER BY clause.

Best Answer

Related Solutions

PostgreSQL – Select Row at the Nth Percentile

PostgreSQL 9.4 – How to Get 10th and 90th Percentile by Customer

Related Question