SQL Server – Calculating Percentile Using Counts of Occurrences

sql serversql-server-2016

Example table:

Speed       Count
102         3
201         2
205         9
208         4
301         1
303         2
307         6

Count is the number a specific speed is measured.

I need to calculate the percentile for the column Speed but not on the Speed values in the table but for the values times the count of those values.

Written out the data looks like this:

102,102,102,201,201,205,205,205,205,205,205,205,205,205,208,208,208,208,301,303,303,307,307,307,307,307,307

I know how to calculate the percentile on this set of data. But is there a way to use the table data without first transforming it in one huge set of individual numbers? The sum of Count in the real data runs in the billions so transforming the data in individual values would result in a (temporary) dataset of billions of rows.

If transformation is nevertheless necessary, what is the easiest way to do this?

I'm using SQL Server 2016

Best Answer

I'll walk you through one solution to calculate PERCENTILE_DISC for percentiles 1, 2, ... 100. Apparently there is not an agreed upon definition for percentile. It appears that PERCENTILE_DISC mostly implements the nearest rank method, except that low percentiles are not undefined (see the last bullet point in the wiki).

One definition of percentile, often given in texts, is that the P-th percentile of a list of N ordered values (sorted from least to greatest) is the smallest value in the list such that P percent of the data is less than or equal to that value.

Per wikipedia, the percentile P (0 < P <= 100) is the value located at ordinal rank n = CEILING (N * P / 100), where N is the number of ordered values. For your problem, N = SUM(COUNT) from the table. For a given percentile P you need to find the nth row in the table. You can find that without exploding all of the values into separate rows by calculating a running total. If you take the row with the largest speed with a running total less than or equal to n then you have the percentile value. One implementation of that is below.

First we'll need a numbers table. These are invaluable for many SQL problems. In this example, I'll only put 100 integers in my numbers table because that's all I need to solve this one, but if you create one on your database you should add more than that.

CREATE TABLE #X_NUMBERS (NUM INTEGER NOT NULL, PRIMARY KEY (NUM));

-- put 100 integers into #X_NUMBERS table
WITH e1(n) AS
(
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), -- 10
e2(n) AS (SELECT 1 FROM e1 CROSS JOIN e1 AS b)
INSERT INTO #X_NUMBERS
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM e2;

Here is the sample data from the question:

CREATE TABLE #X_SPEED_TABLE (SPEED INT, CNT INT, PRIMARY KEY (SPEED));

-- this is your test data
INSERT INTO #X_SPEED_TABLE
VALUES 
(102, 3),
(201, 2),
(205, 9),
(208, 4),
(301, 1),
(303, 2),
(307, 6);

To make the algorithm easier to present let's just start with percentiles 10, 20, ... 100. Suppose I take the sum of the CNT column from my table and multiply it by 0.1, 0.2, ... 1.0. I need to take the CEILING of the values (this is RANK_COL):

╔══════════╦══════════╦═══════╗
║ PERC_COL ║ RANK_COL ║ SPEED ║
╠══════════╬══════════╬═══════╣
║       10 ║        3 ║ NULL  ║
║       20 ║        6 ║ NULL  ║
║       30 ║        9 ║ NULL  ║
║       40 ║       11 ║ NULL  ║
║       50 ║       14 ║ NULL  ║
║       60 ║       17 ║ NULL  ║
║       70 ║       19 ║ NULL  ║
║       80 ║       22 ║ NULL  ║
║       90 ║       25 ║ NULL  ║
║      100 ║       27 ║ NULL  ║
╚══════════╩══════════╩═══════╝

Now I'll look at my data and calculate a running total based on speed. For your sample data:

╔══════════╦══════════╦═══════╗
║ PERC_COL ║ RANK_COL ║ SPEED ║
╠══════════╬══════════╬═══════╣
║       -1 ║        3 ║   102 ║
║       -1 ║        5 ║   201 ║
║       -1 ║       14 ║   205 ║
║       -1 ║       18 ║   208 ║
║       -1 ║       19 ║   301 ║
║       -1 ║       21 ║   303 ║
║       -1 ║       27 ║   307 ║
╚══════════╩══════════╩═══════╝

Let's combine that data and order it by RANK_COL descending and PERC_COL ascending:

╔══════════╦══════════╦═══════╗
║ PERC_COL ║ RANK_COL ║ SPEED ║
╠══════════╬══════════╬═══════╣
║       -1 ║       27 ║ 307   ║
║      100 ║       27 ║ NULL  ║
║       90 ║       25 ║ NULL  ║
║       80 ║       22 ║ NULL  ║
║       -1 ║       21 ║ 303   ║
║       -1 ║       19 ║ 301   ║
║       70 ║       19 ║ NULL  ║
║       -1 ║       18 ║ 208   ║
║       60 ║       17 ║ NULL  ║
║       -1 ║       14 ║ 205   ║
║       50 ║       14 ║ NULL  ║
║       40 ║       11 ║ NULL  ║
║       30 ║        9 ║ NULL  ║
║       20 ║        6 ║ NULL  ║
║       -1 ║        5 ║ 201   ║
║       -1 ║        3 ║ 102   ║
║       10 ║        3 ║ NULL  ║
╚══════════╩══════════╩═══════╝

Now let's find the minimum SPEED value as I loop through the rows:

╔══════════╦═══════╗
║ PERC_COL ║ SPEED ║
╠══════════╬═══════╣
║       -1 ║   307 ║
║      100 ║   307 ║
║       90 ║   307 ║
║       80 ║   307 ║
║       -1 ║   303 ║
║       -1 ║   301 ║
║       70 ║   301 ║
║       -1 ║   208 ║
║       60 ║   208 ║
║       -1 ║   205 ║
║       50 ║   205 ║
║       40 ║   205 ║
║       30 ║   205 ║
║       20 ║   205 ║
║       -1 ║   201 ║
║       -1 ║   102 ║
║       10 ║   102 ║
╚══════════╩═══════╝

Finally, filter out the rows with a -1 value for PERC_COL. You are left with the percentile calculations for percentiles 10, 20, ... 100:

╔══════════╦═══════╗
║ PERC_COL ║ SPEED ║
╠══════════╬═══════╣
║      100 ║   307 ║
║       90 ║   307 ║
║       80 ║   307 ║
║       70 ║   301 ║
║       60 ║   208 ║
║       50 ║   205 ║
║       40 ║   205 ║
║       30 ║   205 ║
║       20 ║   205 ║
║       10 ║   102 ║
╚══════════╩═══════╝

The above algorithm can be implemented in SQL using window functions. Here is one such implementation for N = 1, 2, ... 100:

SELECT 
  PERC_COL
, SPEED
FROM
(
    SELECT
    PERC_COL
    , MIN(SPEED) OVER (ORDER BY RANK_COL DESC, PERC_COL ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) SPEED
    FROM 
    (
        SELECT 
          NUM PERC_COL
        , CEILING(s.SUM_SPEED * 0.01 * NUM) RANK_COL
        , NULL SPEED
        FROM #X_NUMBERS
        CROSS JOIN (SELECT SUM(CNT) SUM_SPEED FROM #X_SPEED_TABLE) s
        WHERE NUM <= 100

        UNION ALL

        SELECT
          -1 PERC_COL
        , SUM(CNT) OVER (ORDER BY SPEED ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) RANK_COL
        , SPEED
        FROM #X_SPEED_TABLE
    ) t
) t2
WHERE t2.PERC_COL <> -1;

To check my results, I'm going to calculate all of the percentiles using the built-in PERCENTILE_DISC. They match for your sample data and a few other random data sets that I created.

-- this is code to test solutions against PERCENTILE_DISC for percentiles 1, 2, ... 100
-- it's not meant to be a well performing solution
DECLARE
@perc INT,
@speed INT;
BEGIN
    DECLARE @results_table TABLE (PERC INT, SPEED INT);

    SET @perc = 1;
    WHILE @perc <= 100
    BEGIN
        INSERT INTO @results_table (PERC, SPEED)
        SELECT TOP 1 @perc, PERCENTILE_DISC(0.01 * @perc) WITHIN GROUP (ORDER BY SPEED) OVER ()
        FROM 
        (
            SELECT SPEED
            FROM #X_SPEED_TABLE xst
            INNER JOIN #X_NUMBERS n ON xst.CNT >= n.NUM
        ) t;

        SET @perc = @perc + 1;
    END;

    SELECT * FROM @results_table;
END;