PostgreSQL – How to Select Longest Continuous Sequence?

gaps-and-islandspostgresqlpostgresql-9.0window functions

I am trying to construct a query in PostgreSQL 9.0 that gets the longest sequence of continuous rows for a specific column.

Consider the following table:

lap_id (serial), lap_no (int), car_type (enum), race_id (int FK)

Where lap_no is unique for each (race_id, car_type).

I would like the query to produce the longest sequence for a given race_id and car_type, so it would return an int (or long) that is highest.

With the following data:

1, 1, red, 1
2, 2, red, 1
3, 3, red, 1
4, 4, red, 1
5, 1, blue, 1
6, 5, red, 1
7, 2, blue, 1
8, 1, green, 1

For car_type = red and race_id = 1 the query would return 5 as the longest sequence of the lap_no field.

I found a similar question here however my situation is a bit more straightforward.

(I would also like to know the longest sequence for a given car_type for all races, but was planning to work that out myself.)

Best Answer

Your description results in a table definition like this:

CREATE TABLE tbl (
   lap_id   serial PRIMARY KEY
 , lap_no   int NOT NULL
 , car_type enum NOT NULL
 , race_id  int NOT NULL  -- REFERENCES ...
 , UNIQUE(race_id, car_type, lap_no)
);

General solution for this class of problems

To get the longest sequence (1 result, longest of all, arbitrary pick if there are ties):

SELECT race_id, car_type, count(*) AS seq_len
FROM  (
   SELECT *, count(*) FILTER (WHERE step)
                      OVER (ORDER BY race_id, car_type, lap_no) AS grp
   FROM  (
      SELECT *, (lag(lap_no) OVER (PARTITION BY race_id, car_type ORDER BY lap_no) + 1)
                 IS DISTINCT FROM lap_no AS step
      FROM   tbl
      ) x
   ) y
GROUP  BY race_id, car_type, grp
ORDER  BY seq_len DESC
LIMIT  1;

count(*) FILTER (WHERE step) only counts TRUE (= step to next group), which results in a new number for every new group.

Related question on SO, one answer featuring a procedural solution with plpgsql:

If the top requirement is performance, the plpgsql function is typically faster in this particular case because it can calculate the result in a single scan.

Faster for consecutive numbers

We can capitalize on the fact that consecutive lap_no define a sequence, for a much simpler and faster version:

SELECT race_id, car_type, count(*) AS seq_len
FROM  (
   SELECT race_id, car_type
        , row_number() OVER (PARTITION BY race_id, car_type ORDER BY lap_no) - lap_no AS grp
   FROM   tbl
   ) x
GROUP  BY race_id, car_type, grp
ORDER  BY seq_len DESC
LIMIT  1;

Consecutive laps end up in the same grp. Every missing lap results in a lower grp per partition.

This relies on (race_id, car_type, lap_no) being UNIQUE NOT NULL. NULL values or duplicates could break the logic.

Discussion of Jack's simpler alternative

@Jack's version effectively counts all laps (rows) where the previous lap_no in this race_id had the same car_type. That's simpler and faster and correct - as long as each car_type can only have one sequence per race_id.

But for a task that simple the query could be simpler, yet. It would follow logically that all lap_no per (car_type, race_id) must be in sequence, and we could just count the laps:

SELECT race_id, car_type, count(*) AS seq_len
FROM   tbl
GROUP  BY race_id, car_type
ORDER  BY seq_len DESC
LIMIT  1;

If, on the other hand, one car_type can have multiple separate sequences per race_id (and the question does not specify otherwise), Jack's version will fail.

Faster for a given race / car type

In reply to the comment / clarifications in the question: restricting the query to one given (race_id, car_type) will make it much faster, of course:

SELECT count(*) AS seq_len
FROM  (
   SELECT row_number() OVER (ORDER BY lap_no) - lap_no AS grp
   FROM   tbl
   WHERE  race_id = 1
   AND    car_type = 'red'
   ) x
GROUP  BY grp
ORDER  BY seq_len DESC
LIMIT  1;

db<>fiddle here
Old SQL Fiddle

Index

Key to top performance is a fitting index (except for the mentioned procedural solution working with a single sequential scan). A multicolumn index like this serves best:

CREATE INDEX tbl_mult_idx ON tbl (race_id, car_type, lap_no);

If your table has the UNIQUE constraint I assumed at the top, that is implemented with just this (unique) index internally, and you do not need to create another index.