PostgreSQL – How to Select Longest Continuous Sequence?

gaps-and-islandspostgresqlpostgresql-9.0window functions

I am trying to construct a query in PostgreSQL 9.0 that gets the longest sequence of continuous rows for a specific column.

Consider the following table:

lap_id (serial), lap_no (int), car_type (enum), race_id (int FK)

Where lap_no is unique for each (race_id, car_type).

I would like the query to produce the longest sequence for a given race_id and car_type, so it would return an int (or long) that is highest.

With the following data:

1, 1, red, 1
2, 2, red, 1
3, 3, red, 1
4, 4, red, 1
5, 1, blue, 1
6, 5, red, 1
7, 2, blue, 1
8, 1, green, 1

For car_type = red and race_id = 1 the query would return 5 as the longest sequence of the lap_no field.

I found a similar question here however my situation is a bit more straightforward.

_{(I would also like to know the longest sequence for a given car_type for all races, but was planning to work that out myself.)}

Best Answer

Your description results in a table definition like this:

CREATE TABLE tbl (
   lap_id   serial PRIMARY KEY
 , lap_no   int NOT NULL
 , car_type enum NOT NULL
 , race_id  int NOT NULL  -- REFERENCES ...
 , UNIQUE(race_id, car_type, lap_no)
);

General solution for this class of problems

To get the longest sequence (1 result, longest of all, arbitrary pick if there are ties):

SELECT race_id, car_type, count(*) AS seq_len
FROM  (
   SELECT *, count(*) FILTER (WHERE step)
                      OVER (ORDER BY race_id, car_type, lap_no) AS grp
   FROM  (
      SELECT *, (lag(lap_no) OVER (PARTITION BY race_id, car_type ORDER BY lap_no) + 1)
                 IS DISTINCT FROM lap_no AS step
      FROM   tbl
      ) x
   ) y
GROUP  BY race_id, car_type, grp
ORDER  BY seq_len DESC
LIMIT  1;

count(*) FILTER (WHERE step) only counts TRUE (= step to next group), which results in a new number for every new group.

Related question on SO, one answer featuring a procedural solution with plpgsql:

GROUP BY and aggregate sequential numeric values

If the top requirement is performance, the plpgsql function is typically faster in this particular case because it can calculate the result in a single scan.

Faster for consecutive numbers

We can capitalize on the fact that consecutive lap_no define a sequence, for a much simpler and faster version:

SELECT race_id, car_type, count(*) AS seq_len
FROM  (
   SELECT race_id, car_type
        , row_number() OVER (PARTITION BY race_id, car_type ORDER BY lap_no) - lap_no AS grp
   FROM   tbl
   ) x
GROUP  BY race_id, car_type, grp
ORDER  BY seq_len DESC
LIMIT  1;

Consecutive laps end up in the same grp. Every missing lap results in a lower grp per partition.

This relies on (race_id, car_type, lap_no) being UNIQUE NOT NULL. NULL values or duplicates could break the logic.

Discussion of Jack's simpler alternative

@Jack's version effectively counts all laps (rows) where the previous lap_no in this race_id had the same car_type. That's simpler and faster and correct - as long as each car_type can only have one sequence per race_id.

But for a task that simple the query could be simpler, yet. It would follow logically that all lap_no per (car_type, race_id) must be in sequence, and we could just count the laps:

SELECT race_id, car_type, count(*) AS seq_len
FROM   tbl
GROUP  BY race_id, car_type
ORDER  BY seq_len DESC
LIMIT  1;

If, on the other hand, one car_type can have multiple separate sequences per race_id (and the question does not specify otherwise), Jack's version will fail.

Faster for a given race / car type

In reply to the comment / clarifications in the question: restricting the query to one given (race_id, car_type) will make it much faster, of course:

SELECT count(*) AS seq_len
FROM  (
   SELECT row_number() OVER (ORDER BY lap_no) - lap_no AS grp
   FROM   tbl
   WHERE  race_id = 1
   AND    car_type = 'red'
   ) x
GROUP  BY grp
ORDER  BY seq_len DESC
LIMIT  1;

db<>fiddle here
Old SQL Fiddle

Index

Key to top performance is a fitting index (except for the mentioned procedural solution working with a single sequential scan). A multicolumn index like this serves best:

CREATE INDEX tbl_mult_idx ON tbl (race_id, car_type, lap_no);

If your table has the UNIQUE constraint I assumed at the top, that is implemented with just this (unique) index internally, and you do not need to create another index.

Related Solutions

Postgresql – Efficiently select beginning and end of multiple contiguous ranges in Postgresql query

How about using with recursive

test view:

create view v as 
select *
from ( values ('foo', 2),
              ('foo', 3),
              ('foo', 4),
              ('foo', 10),
              ('foo', 11),
              ('foo', 13),
              ('bar', 1),
              ('bar', 2),
              ('bar', 3)
     ) as baz ("name", "int");

query:

with recursive t("name", "int") as ( select "name", "int", 1 as span from v
                                     union all
                                     select "name", v."int", t.span+1 as span
                                     from v join t using ("name")
                                     where v."int"=t."int"+1 )
select "name", "start", "start"+span-1 as "end", span
from( select "name", ("int"-span+1) as "start", max(span) as span
      from ( select "name", "int", max(span) as span 
             from t
             group by "name", "int" ) z
      group by "name", ("int"-span+1) ) z;

result:

 name | start | end | span
------+-------+-----+------
 foo  |     2 |   4 |    3
 foo  |    13 |  13 |    1
 bar  |     1 |   3 |    3
 foo  |    10 |  11 |    2
(4 rows)

I'd be interested to know how that performs on your billion row table.

Postgresql – How to keep an unique counter per row with PostgreSQL

Assuming you store all revisions of the document in a table, an approach would be to not store the revision number but calculate it based on the number of revisions stored in the table.

It is, essentially, a derived value, not something that you need to store.

A window function can be used to calculate the revision number, something like

row_number() over (partition by document_id order by <change_date>)

and you'll need a column something like change_date to keep track of the order of the revisions.

On the other hand, if you just have revision as a property of the document and it indicates "how many times the document has changed", then I would go for the optimistic locking approach, something like:

update documents
set revision = revision + 1
where document_id = <id> and revision = <old_revision>;

If this updates 0 rows, then there has been intermediate update and you need to inform the user of this.

In general, try to keep your solution as simple as possible. In this case by

avoiding the use of explicit locking functions unless absolutely necessary
having fewer database objects (no per document sequences) and storing fewer attributes (don't store the revision if it can be calculated)
using a single update statement rather than a select followed by an insert or update