PostgreSQL – How to Use Row Number with Reset

ctepostgresqlpostgresql-9.4window functions

I have this data set in PostgreSQL:

    User_id    |            date         | is_cleared |
---------------+-------------------------+------------+
             C | 2016-08-03 18:28:38.37  |          1 |
             A | 2016-08-01 19:59:36.548 |          0 |
             A | 2016-08-01 23:38:29.912 |          0 |
             A | 2016-08-04 19:03:30.585 |          0 |
             A | 2016-08-08 19:53:20.251 |          0 |
             A | 2016-08-11 21:00:16.075 |          0 |
             A | 2016-08-13 21:07:01.336 |          0 |
             A | 2016-08-15 10:17:32.746 |          0 |
             A | 2016-08-17 11:54:16.75  |          0 |
             A | 2016-08-18 21:12:22.717 |          0 |
             A | 2016-08-21 19:59:26.441 |          0 |
             A | 2016-08-22 17:19:42.215 |          0 |
             A | 2016-08-24 20:18:45.516 |          1 |
             A | 2016-08-27 13:22:33.98  |          1 |
             A | 2016-08-31 16:09:25.756 |          0 |
             B | 2016-08-03 21:10:32.945 |          1 |
             B | 2016-08-05 20:40:55.644 |          1 |
             B | 2016-08-06 10:59:00.796 |          1 |
             B | 2016-08-09 22:36:50.674 |          0 |
             B | 2016-08-11 09:37:21.29  |          1 |
             B | 2016-08-13 10:15:15.858 |          1 |
             B | 2016-08-26 09:56:48.937 |          1 |
             B | 2016-08-29 09:34:13.96  |          1 |

And I need an output like this:

    User_id    |            date         | is_cleared |  custom_rank   
---------------+-------------------------+------------+---------
             C | 2016-08-03 18:28:38.37  |          1 |  1
             A | 2016-08-01 19:59:36.548 |          0 |  1
             A | 2016-08-01 23:38:29.912 |          0 |  2
             A | 2016-08-04 19:03:30.585 |          0 |  3
             A | 2016-08-08 19:53:20.251 |          0 |  4
             A | 2016-08-11 21:00:16.075 |          0 |  5
             A | 2016-08-13 21:07:01.336 |          0 |  6
             A | 2016-08-15 10:17:32.746 |          0 |  7
             A | 2016-08-17 11:54:16.75  |          0 |  8
             A | 2016-08-18 21:12:22.717 |          0 |  9
             A | 2016-08-21 19:59:26.441 |          0 |  10
             A | 2016-08-22 17:19:42.215 |          0 |  11
             A | 2016-08-24 20:18:45.516 |          1 |  12
             A | 2016-08-27 13:22:33.98  |          1 |  1
             A | 2016-08-31 16:09:25.756 |          0 |  1
             B | 2016-08-03 21:10:32.945 |          1 |  1
             B | 2016-08-05 20:40:55.644 |          1 |  1
             B | 2016-08-06 10:59:00.796 |          1 |  1
             B | 2016-08-09 22:36:50.674 |          0 |  1
             B | 2016-08-11 09:37:21.29  |          1 |  2
             B | 2016-08-13 10:15:15.858 |          1 |  1
             B | 2016-08-26 09:56:48.937 |          1 |  1
             B | 2016-08-29 09:34:13.96  |          1 |  1

Basically my goal is to count the number of zeros including the 1st occurrence of "1" after the zeros in the is_cleared flag.

The data is sorted by date and should also be partitioned by user_id.

Best Answer

Window functions and a single subquery:

SELECT user_id, date, is_cleared
     , count(*) OVER (PARTITION BY user_id, grp ORDER BY date) AS custom_rank
FROM  (
   SELECT user_id, date, is_cleared
        , count(is_cleared = 1 OR NULL) OVER (PARTITION BY user_id ORDER BY date DESC) AS grp
   FROM   tbl
   ) t
ORDER  BY user_id, date;

The special difficulty here: each partition ends with 1. In most similar cases, each partition would start with 1. The simple solution is to count occurrences of 1 in descending order. This way each partition includes the next 1 if it exists. Voilá.

Assuming all involved columns to be NOT NULL and (user_id, date) unique.

Basics:

Select longest continuous sequence

Asides:

The first row of B gets custom_rank = 1. Not 2.
The column is_cleared should really be boolean.
A timestamp column shouldn't be called "date".

Related Solutions

PostgreSQL – Select Row at the Nth Percentile

Use the window function ntile() in a subquery (requires Postgres 8.4 or later).
Then select the segments you are interested in (corresponding to percentiles) and pick the row with the lowest value from it:

SELECT DISTINCT ON (segment)
       the_date, to_char((segment - 1)/ 10.0, '99.9') AS percentile, ans
FROM  (
    SELECT t1.the_date 
          ,ntile(1000) OVER (ORDER BY (t2.latency - t1.latency)) AS segment
          ,(t2.latency - t1.latency) AS ans
    FROM   table1 t1
    JOIN   table2 t2 ON t1.id = t2.id
   ) sub
WHERE  segment IN (601, 901, 991, 1000)
ORDER  BY segment, ans;

The Postgres-specific DISTINCT ON comes in handy for the last step. Detailed explanation in this related answer on SO:
Select first row in each GROUP BY group?

To get the 90, 99 and 99.9 percentile I picked the matching granularity with ntile(1000). And added a 60 percentile as per comment.

This algorithm picks the row at or above the exact value. You can add a line to the subquery with percent_rank() to get the exact relative rank of the select row in addition:

 percent_rank() OVER (ORDER BY (t2.latency - t1.latency)) AS pct_rank

Aside: I replaced the column name date with the_date since I am in the habbit of avoiding reserved SQL key words as identifiers, even if Postgres would permit them.

ORDER BY vs ROW_NUMBER for Top N Queries in PostgreSQL

First, to answer the questions implied in the comments, that the assignment of row numbers with the ROW_NUMBER() aggregate seems inefficient because we already have the COUNT(violation) numbers:

This is needed because the COUNT numbers can be different for each partition (class). Since we want the 2 top numbers (for each class), we can't find a useful condition for that. With the row numbers, we can use the WHERE v < 3 which gives us the top 2.

In version 9.3, the LATERAL joins were added in Postgres, which are similar to the CROSS and OUTER APPLY of SQL-Server. With this new kind of join, you can write a query that uses the COUNT numbers and a TOP 2 for each partition. Whether it is more or less efficient, you can test:

WITH classes AS
  ( SELECT DISTINCT class
    FROM people
  ) 
                             -- if you have a "classes" table, skip the above lines
SELECT 
    v.id, v.name, c.class,
    v.violations
FROM
    classes AS c,
  LATERAL 
    ( SELECT p.id, p.name,
             COUNT(d.violation) AS violations
      FROM people AS p
        INNER JOIN discipline AS d
          ON d.people_id = p.id
      WHERE p.class = c.class 
      GROUP BY p.id, p.name
      ORDER BY violations DESC
      LIMIT 2
  ) AS v
ORDER BY
    c.class, v.violations DESC ;

Tested at SQL-Fiddle

Best Answer

Related Solutions

PostgreSQL – Select Row at the Nth Percentile

ORDER BY vs ROW_NUMBER for Top N Queries in PostgreSQL

Related Question