SQL Server – Calculating Percentile Using Counts of Occurrences

sql serversql-server-2016

Example table:

Speed       Count
102         3
201         2
205         9
208         4
301         1
303         2
307         6

Count is the number a specific speed is measured.

I need to calculate the percentile for the column Speed but not on the Speed values in the table but for the values times the count of those values.

Written out the data looks like this:

102,102,102,201,201,205,205,205,205,205,205,205,205,205,208,208,208,208,301,303,303,307,307,307,307,307,307

I know how to calculate the percentile on this set of data. But is there a way to use the table data without first transforming it in one huge set of individual numbers? The sum of Count in the real data runs in the billions so transforming the data in individual values would result in a (temporary) dataset of billions of rows.

If transformation is nevertheless necessary, what is the easiest way to do this?

I'm using SQL Server 2016

Best Answer

I'll walk you through one solution to calculate PERCENTILE_DISC for percentiles 1, 2, ... 100. Apparently there is not an agreed upon definition for percentile. It appears that PERCENTILE_DISC mostly implements the nearest rank method, except that low percentiles are not undefined (see the last bullet point in the wiki).

One definition of percentile, often given in texts, is that the P-th percentile of a list of N ordered values (sorted from least to greatest) is the smallest value in the list such that P percent of the data is less than or equal to that value.

Per wikipedia, the percentile P (0 < P <= 100) is the value located at ordinal rank n = CEILING (N * P / 100), where N is the number of ordered values. For your problem, N = SUM(COUNT) from the table. For a given percentile P you need to find the nth row in the table. You can find that without exploding all of the values into separate rows by calculating a running total. If you take the row with the largest speed with a running total less than or equal to n then you have the percentile value. One implementation of that is below.

First we'll need a numbers table. These are invaluable for many SQL problems. In this example, I'll only put 100 integers in my numbers table because that's all I need to solve this one, but if you create one on your database you should add more than that.

CREATE TABLE #X_NUMBERS (NUM INTEGER NOT NULL, PRIMARY KEY (NUM));

-- put 100 integers into #X_NUMBERS table
WITH e1(n) AS
(
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), -- 10
e2(n) AS (SELECT 1 FROM e1 CROSS JOIN e1 AS b)
INSERT INTO #X_NUMBERS
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM e2;

Here is the sample data from the question:

CREATE TABLE #X_SPEED_TABLE (SPEED INT, CNT INT, PRIMARY KEY (SPEED));

-- this is your test data
INSERT INTO #X_SPEED_TABLE
VALUES 
(102, 3),
(201, 2),
(205, 9),
(208, 4),
(301, 1),
(303, 2),
(307, 6);

To make the algorithm easier to present let's just start with percentiles 10, 20, ... 100. Suppose I take the sum of the CNT column from my table and multiply it by 0.1, 0.2, ... 1.0. I need to take the CEILING of the values (this is RANK_COL):

╔══════════╦══════════╦═══════╗
║ PERC_COL ║ RANK_COL ║ SPEED ║
╠══════════╬══════════╬═══════╣
║       10 ║        3 ║ NULL  ║
║       20 ║        6 ║ NULL  ║
║       30 ║        9 ║ NULL  ║
║       40 ║       11 ║ NULL  ║
║       50 ║       14 ║ NULL  ║
║       60 ║       17 ║ NULL  ║
║       70 ║       19 ║ NULL  ║
║       80 ║       22 ║ NULL  ║
║       90 ║       25 ║ NULL  ║
║      100 ║       27 ║ NULL  ║
╚══════════╩══════════╩═══════╝

Now I'll look at my data and calculate a running total based on speed. For your sample data:

╔══════════╦══════════╦═══════╗
║ PERC_COL ║ RANK_COL ║ SPEED ║
╠══════════╬══════════╬═══════╣
║       -1 ║        3 ║   102 ║
║       -1 ║        5 ║   201 ║
║       -1 ║       14 ║   205 ║
║       -1 ║       18 ║   208 ║
║       -1 ║       19 ║   301 ║
║       -1 ║       21 ║   303 ║
║       -1 ║       27 ║   307 ║
╚══════════╩══════════╩═══════╝

Let's combine that data and order it by RANK_COL descending and PERC_COL ascending:

╔══════════╦══════════╦═══════╗
║ PERC_COL ║ RANK_COL ║ SPEED ║
╠══════════╬══════════╬═══════╣
║       -1 ║       27 ║ 307   ║
║      100 ║       27 ║ NULL  ║
║       90 ║       25 ║ NULL  ║
║       80 ║       22 ║ NULL  ║
║       -1 ║       21 ║ 303   ║
║       -1 ║       19 ║ 301   ║
║       70 ║       19 ║ NULL  ║
║       -1 ║       18 ║ 208   ║
║       60 ║       17 ║ NULL  ║
║       -1 ║       14 ║ 205   ║
║       50 ║       14 ║ NULL  ║
║       40 ║       11 ║ NULL  ║
║       30 ║        9 ║ NULL  ║
║       20 ║        6 ║ NULL  ║
║       -1 ║        5 ║ 201   ║
║       -1 ║        3 ║ 102   ║
║       10 ║        3 ║ NULL  ║
╚══════════╩══════════╩═══════╝

Now let's find the minimum SPEED value as I loop through the rows:

╔══════════╦═══════╗
║ PERC_COL ║ SPEED ║
╠══════════╬═══════╣
║       -1 ║   307 ║
║      100 ║   307 ║
║       90 ║   307 ║
║       80 ║   307 ║
║       -1 ║   303 ║
║       -1 ║   301 ║
║       70 ║   301 ║
║       -1 ║   208 ║
║       60 ║   208 ║
║       -1 ║   205 ║
║       50 ║   205 ║
║       40 ║   205 ║
║       30 ║   205 ║
║       20 ║   205 ║
║       -1 ║   201 ║
║       -1 ║   102 ║
║       10 ║   102 ║
╚══════════╩═══════╝

Finally, filter out the rows with a -1 value for PERC_COL. You are left with the percentile calculations for percentiles 10, 20, ... 100:

╔══════════╦═══════╗
║ PERC_COL ║ SPEED ║
╠══════════╬═══════╣
║      100 ║   307 ║
║       90 ║   307 ║
║       80 ║   307 ║
║       70 ║   301 ║
║       60 ║   208 ║
║       50 ║   205 ║
║       40 ║   205 ║
║       30 ║   205 ║
║       20 ║   205 ║
║       10 ║   102 ║
╚══════════╩═══════╝

The above algorithm can be implemented in SQL using window functions. Here is one such implementation for N = 1, 2, ... 100:

SELECT 
  PERC_COL
, SPEED
FROM
(
    SELECT
    PERC_COL
    , MIN(SPEED) OVER (ORDER BY RANK_COL DESC, PERC_COL ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) SPEED
    FROM 
    (
        SELECT 
          NUM PERC_COL
        , CEILING(s.SUM_SPEED * 0.01 * NUM) RANK_COL
        , NULL SPEED
        FROM #X_NUMBERS
        CROSS JOIN (SELECT SUM(CNT) SUM_SPEED FROM #X_SPEED_TABLE) s
        WHERE NUM <= 100

        UNION ALL

        SELECT
          -1 PERC_COL
        , SUM(CNT) OVER (ORDER BY SPEED ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) RANK_COL
        , SPEED
        FROM #X_SPEED_TABLE
    ) t
) t2
WHERE t2.PERC_COL <> -1;

To check my results, I'm going to calculate all of the percentiles using the built-in PERCENTILE_DISC. They match for your sample data and a few other random data sets that I created.

-- this is code to test solutions against PERCENTILE_DISC for percentiles 1, 2, ... 100
-- it's not meant to be a well performing solution
DECLARE
@perc INT,
@speed INT;
BEGIN
    DECLARE @results_table TABLE (PERC INT, SPEED INT);

    SET @perc = 1;
    WHILE @perc <= 100
    BEGIN
        INSERT INTO @results_table (PERC, SPEED)
        SELECT TOP 1 @perc, PERCENTILE_DISC(0.01 * @perc) WITHIN GROUP (ORDER BY SPEED) OVER ()
        FROM 
        (
            SELECT SPEED
            FROM #X_SPEED_TABLE xst
            INNER JOIN #X_NUMBERS n ON xst.CNT >= n.NUM
        ) t;

        SET @perc = @perc + 1;
    END;

    SELECT * FROM @results_table;
END;

Related Solutions

SQL Server – Updating Computed Column When Table Value Changes

Given the additional information that Aaron Bertrand didn't have access to when he posted his answer I would suggest a different tack.

Instead of putting logic/business significance in table names I would have general table names and put the logic/business significance in attributes/data in the tables. This should make it easier to expand functionality, and maintain your data. Furthermore you can extract useful information much easier.

The following is a rough schema that captures the direction I recommend and will probably need to be adapted to your exact needs:

CREATE TABLE dbo.WeatherStation
(
    WeatherStationId INT NOT NULL PRIMARY KEY IDENTITY(1,1),
    Name NVARCHAR(50) NOT NULL -- This is where you put the name of the station instead of in the table.
)

CREATE TABLE dbo.SensorReading
(
    SensorReadingId INT NOT NULL PRIMARY KEY IDENTITY(1,1),
    WeatherStationId INT NOT NULL FOREIGN KEY REFERENCES dbo.WeatherStation(WeatherStationId), -- Match a reading to the station
    ReportedTime DATETIME2(2) NOT NULL DEFAULT SYSUTCDATETIME(), -- When the time was reported to the database
    <Other columns like temp, pressure, etc.>
)

CREATE TABLE dbo.SensorOffset
(
    SensorOffsetId INT NOT NULL PRIMARY KEY IDENTITY(1,1),
    WeatherStationId INT NOT NULL FOREIGN KEY REFERENCES dbo.WeatherStation(WeatherStationId), -- Match a reading to the station like you do now
    Offset DECIMAL(20, 10) NOT NULL -- Adjust precision/datatype as needed
    Comment NVARCHAR(500) NULL,
    Created DATETIME2(2) NOT NULL DEFAULT SYSUTCDATETIME() -- This would need to be unique per weather station
)

Now you can add a new station without duplicating table schema, you can easily compare data from related stations, etc.

Even if you didn't want/can't change your schema, I would recommend putting the calculation in a view. That is more obvious in my opinion than a trigger, and it would be easier to trouble shoot for me. Something like the following should work with my schema above:

;WITH CurrentOffset_CTE AS
(
    SELECT
        WeatherStationId
        , MAX(Created) AS Created
    FROM dbo.SensorOffset
    GROUP BY
        WeatherStationId
)
SELECT
    WS.Name
    , SR.ReportedTime
    , CASE WHEN SR.<reading> IS NOT NULL THEN SR.<reading> + SO.Offset ELSE NULL END AS <reading>
    , <repeat same pattern as above for the various readings>
FROM dbo.WeatherStation WS
    INNER JOIN dbo.SensorReading SR ON SR.WeatherStationId = WS.WeatherStationId
    INNER JOIN CurrentOffset_CTE CO ON CO.WeatherStationId = WS.WeatherStationId
    INNER JOIN dbo.SensorOffset SO ON SO.WeatherStationId = CO.WeatherStationId AND SO.Created = CO.Created

This will be easy to troubleshoot, hard to miss, and obvious to future maintainers. You could modify this code to work for your current schema too, but would have to duplicate it for each station. In that case I would still recommend this approach for the above stated reasons.

Sql-server – Calculating Running Average using Over clause in SQL server

You can apply your final filter on @dekade after computing the running averages.

In order to reduce the number of rows that need to be processed for the running averages, you can apply an earlier filter on [dekade] IN (@dekade, (@dekade+1)%36, (@dekade+2)%36) to ensure that you are processing the minimal amount of rows but still including all the rows that are necessary to including the following 11 rows in the running average. (The only reason for the % 36 is to handle values at @dekade that fall at the end of the year.)

This will still result in a table scan given your current table structure, but at least the rows can be filtered out earlier on in the query plan.

DECLARE @dekade TINYINT = 1
SELECT *
FROM (
    SELECT 
        LOCID,
        [Year],
        [Date],
        [dekade],
        AVG(cast(RTT as float)) OVER 
            (PARTITION BY LOCID ORDER BY Date
                ROWS BETWEEN CURRENT ROW AND 11 FOLLOWING) AS avgRTT
    FROM TimeSeries
    -- If you want to limit the rows that are use when computing the running average,
    -- you can make sure that only the desired @dekade plus the following two @dekades
    -- (which may be needed to get the following 11 rows) are used for each year
    WHERE [dekade] (@dekade, (@dekade+1)%36, (@dekade+2)%36)
) x
-- Filter your results be @dekade after computing the running average
WHERE x.[dekade] = @dekade

It would be helpful to post some sample data so that we can actually run the query though.

Best Answer

Related Solutions

SQL Server – Updating Computed Column When Table Value Changes

Sql-server – Calculating Running Average using Over clause in SQL server

Related Question