Example table:
Speed Count
102 3
201 2
205 9
208 4
301 1
303 2
307 6
Count is the number a specific speed is measured.
I need to calculate the percentile for the column Speed but not on the Speed values in the table but for the values times the count of those values.
Written out the data looks like this:
102,102,102,201,201,205,205,205,205,205,205,205,205,205,208,208,208,208,301,303,303,307,307,307,307,307,307
I know how to calculate the percentile on this set of data. But is there a way to use the table data without first transforming it in one huge set of individual numbers? The sum of Count in the real data runs in the billions so transforming the data in individual values would result in a (temporary) dataset of billions of rows.
If transformation is nevertheless necessary, what is the easiest way to do this?
I'm using SQL Server 2016
Best Answer
I'll walk you through one solution to calculate PERCENTILE_DISC for percentiles 1, 2, ... 100. Apparently there is not an agreed upon definition for percentile. It appears that PERCENTILE_DISC mostly implements the nearest rank method, except that low percentiles are not undefined (see the last bullet point in the wiki).
Per wikipedia, the percentile
P (0 < P <= 100)
is the value located at ordinal rankn = CEILING (N * P / 100)
, whereN
is the number of ordered values. For your problem,N = SUM(COUNT)
from the table. For a given percentileP
you need to find the nth row in the table. You can find that without exploding all of the values into separate rows by calculating a running total. If you take the row with the largest speed with a running total less than or equal ton
then you have the percentile value. One implementation of that is below.First we'll need a numbers table. These are invaluable for many SQL problems. In this example, I'll only put 100 integers in my numbers table because that's all I need to solve this one, but if you create one on your database you should add more than that.
Here is the sample data from the question:
To make the algorithm easier to present let's just start with percentiles 10, 20, ... 100. Suppose I take the sum of the
CNT
column from my table and multiply it by 0.1, 0.2, ... 1.0. I need to take theCEILING
of the values (this isRANK_COL
):Now I'll look at my data and calculate a running total based on speed. For your sample data:
Let's combine that data and order it by
RANK_COL
descending andPERC_COL
ascending:Now let's find the minimum
SPEED
value as I loop through the rows:Finally, filter out the rows with a -1 value for
PERC_COL
. You are left with the percentile calculations for percentiles 10, 20, ... 100:The above algorithm can be implemented in SQL using window functions. Here is one such implementation for N = 1, 2, ... 100:
To check my results, I'm going to calculate all of the percentiles using the built-in
PERCENTILE_DISC
. They match for your sample data and a few other random data sets that I created.