Postgresql – return continuous set of IDs based on two other columns

gaps-and-islandspostgresql

Given a table such as the one below, how can I ask PostgreSQL to return the first set of n rows in which one column has a certain value (code=0) and the values of the second column (named segment) are all the same, and the sequence of IDs is not broken.

For n=4, the result should be id's 12, 13, 14, 15 (i.e. the first four id's where code=0 and the values of segment are the same – in this case 2).

CREATE TABLE foo
AS
  SELECT *
  FROM ( VALUES
    (  1, 0, 0 ),
    (  2, 0, 0 ),
    (  3, 0, 0 ),
    (  4, 0, 1 ),
    (  5, 0, 1 ),
    (  6, 0, 1 ),
    (  7, 1, 1 ),
    (  8, 0, 1 ),
    (  9, 1, 2 ),
    ( 10, 0, 2 ),   
    ( 11, 1, 2 ),
    ( 12, 0, 2 ), -- HERE
    ( 13, 0, 2 ), -- HERE
    ( 14, 0, 2 ), -- HERE
    ( 15, 0, 2 ), -- HERE
    ( 16, 0, 3 )
  ) AS t(id,code,segment);

Best Answer

You can do something like this..

SELECT
  array_agg(id),
  code,
  segment,
  grp,
  count(*)
FROM (
  SELECT
    id,
    code,
    segment,
    count(is_reset) OVER (ORDER BY id) AS grp
  FROM (
    SELECT
      id,
      code,
      segment,
      CASE
        WHEN
          code <> lag(code) OVER (ORDER BY id)
          OR segment <> lag(segment) OVER (ORDER BY id)
        THEN 1
      END AS is_reset
    FROM foo
  ) AS t
) AS g
GROUP BY code, segment, grp;

   array_agg   | code | segment | grp | count 
---------------+------+---------+-----+-------
 {11}          |    1 |       2 |   6 |     1
 {10}          |    0 |       2 |   5 |     1
 {1,2,3}       |    0 |       0 |   0 |     3
 {7}           |    1 |       1 |   2 |     1
 {4,5,6}       |    0 |       1 |   1 |     3
 {16}          |    0 |       3 |   8 |     1
 {8}           |    0 |       1 |   3 |     1
 {12,13,14,15} |    0 |       2 |   7 |     4
 {9}           |    1 |       2 |   4 |     1
(9 rows)

To query this, just search for code = 0 AND count >= 4.

Explanation

First we generate a reset

SELECT
  id,
  code,
  segment,
  CASE
    WHEN
      code <> lag(code) OVER (ORDER BY id)
      OR segment <> lag(segment) OVER (ORDER BY id)
    THEN 1
  END AS is_reset
FROM foo

 id | code | segment | is_reset 
----+------+---------+----------
  1 |    0 |       0 |         
  2 |    0 |       0 |         
  3 |    0 |       0 |         
  4 |    0 |       1 |        1
  5 |    0 |       1 |         
  6 |    0 |       1 |         
  7 |    1 |       1 |        1
  8 |    0 |       1 |        1
  9 |    1 |       2 |        1
 10 |    0 |       2 |        1
 11 |    1 |       2 |        1
 12 |    0 |       2 |        1
 13 |    0 |       2 |         
 14 |    0 |       2 |         
 15 |    0 |       2 |         
 16 |    0 |       3 |        1
(16 rows)

Then we create groups,

  SELECT
    id,
    code,
    segment,
    count(is_reset) OVER (ORDER BY id) AS grp
  FROM (
    SELECT
      id,
      code,
      segment,
      CASE
        WHEN
          code <> lag(code) OVER (ORDER BY id)
          OR segment <> lag(segment) OVER (ORDER BY id)
        THEN 1
      END AS is_reset
    FROM foo
  ) AS t;
 id | code | segment | grp 
----+------+---------+-----
  1 |    0 |       0 |   0
  2 |    0 |       0 |   0
  3 |    0 |       0 |   0
  4 |    0 |       1 |   1
  5 |    0 |       1 |   1
  6 |    0 |       1 |   1
  7 |    1 |       1 |   2
  8 |    0 |       1 |   3
  9 |    1 |       2 |   4
 10 |    0 |       2 |   5
 11 |    1 |       2 |   6
 12 |    0 |       2 |   7
 13 |    0 |       2 |   7
 14 |    0 |       2 |   7
 15 |    0 |       2 |   7
 16 |    0 |       3 |   8
(16 rows)

Now you'll notice that the groups a fn(code,segment), reseting when either changes. Now we group by the grp, and do a count(*), but because you want the ids we run them through array_agg and store them as an array.

Note, you say and the sequence of IDs is not broken. I'm working off the assumption that this is not what you meant. PostgreSQL doesn't have a concept of gapless sequences, so that's not a good idea anyway. I assume you wanted what I've done.

General solution for this class of problems

To get the longest sequence (1 result, longest of all, arbitrary pick if there are ties):

SELECT race_id, car_type, count(*) AS seq_len
FROM  (
   SELECT *, count(*) FILTER (WHERE step)
                      OVER (ORDER BY race_id, car_type, lap_no) AS grp
   FROM  (
      SELECT *, (lag(lap_no) OVER (PARTITION BY race_id, car_type ORDER BY lap_no) + 1)
                 IS DISTINCT FROM lap_no AS step
      FROM   tbl
      ) x
   ) y
GROUP  BY race_id, car_type, grp
ORDER  BY seq_len DESC
LIMIT  1;

count(*) FILTER (WHERE step) only counts TRUE (= step to next group), which results in a new number for every new group.

Related question on SO, one answer featuring a procedural solution with plpgsql:

GROUP BY and aggregate sequential numeric values

If the top requirement is performance, the plpgsql function is typically faster in this particular case because it can calculate the result in a single scan.

Faster for consecutive numbers

We can capitalize on the fact that consecutive lap_no define a sequence, for a much simpler and faster version:

SELECT race_id, car_type, count(*) AS seq_len
FROM  (
   SELECT race_id, car_type
        , row_number() OVER (PARTITION BY race_id, car_type ORDER BY lap_no) - lap_no AS grp
   FROM   tbl
   ) x
GROUP  BY race_id, car_type, grp
ORDER  BY seq_len DESC
LIMIT  1;

Consecutive laps end up in the same grp. Every missing lap results in a lower grp per partition.

This relies on (race_id, car_type, lap_no) being UNIQUE NOT NULL. NULL values or duplicates could break the logic.

Discussion of Jack's simpler alternative

@Jack's version effectively counts all laps (rows) where the previous lap_no in this race_id had the same car_type. That's simpler and faster and correct - as long as each car_type can only have one sequence per race_id.

But for a task that simple the query could be simpler, yet. It would follow logically that all lap_no per (car_type, race_id) must be in sequence, and we could just count the laps:

SELECT race_id, car_type, count(*) AS seq_len
FROM   tbl
GROUP  BY race_id, car_type
ORDER  BY seq_len DESC
LIMIT  1;

If, on the other hand, one car_type can have multiple separate sequences per race_id (and the question does not specify otherwise), Jack's version will fail.

Faster for a given race / car type

In reply to the comment / clarifications in the question: restricting the query to one given (race_id, car_type) will make it much faster, of course:

SELECT count(*) AS seq_len
FROM  (
   SELECT row_number() OVER (ORDER BY lap_no) - lap_no AS grp
   FROM   tbl
   WHERE  race_id = 1
   AND    car_type = 'red'
   ) x
GROUP  BY grp
ORDER  BY seq_len DESC
LIMIT  1;

db<>fiddle here
Old SQL Fiddle

Index

Key to top performance is a fitting index (except for the mentioned procedural solution working with a single sequential scan). A multicolumn index like this serves best:

CREATE INDEX tbl_mult_idx ON tbl (race_id, car_type, lap_no);

If your table has the UNIQUE constraint I assumed at the top, that is implemented with just this (unique) index internally, and you do not need to create another index.

PostgreSQL – How to Use Column with Hash Based on Other Columns as Unique Index

You could use a UNIQUE INDEX instead of the UNIQUE CONSTRAINT using the function coalesce to treat null as a regular value for your uniqueness:

CREATE UNIQUE INDEX ON offer (
       product_id,
       coalesce(price_old,-1),
       coalesce(price,-1),
       coalesce(price_alt,''),
       valid_from,
       valid_to);

This will enforce your uniqueness as you described.

For details between a Unique constraint and a unique index,see: https://stackoverflow.com/questions/23542794/postgres-unique-constraint-vs-index