I am trying to construct a query in PostgreSQL 9.0 that gets the longest sequence of continuous rows for a specific column.
Consider the following table:
lap_id (serial), lap_no (int), car_type (enum), race_id (int FK)
Where lap_no
is unique for each (race_id, car_type)
.
I would like the query to produce the longest sequence for a given race_id
and car_type
, so it would return an int
(or long) that is highest.
With the following data:
1, 1, red, 1
2, 2, red, 1
3, 3, red, 1
4, 4, red, 1
5, 1, blue, 1
6, 5, red, 1
7, 2, blue, 1
8, 1, green, 1
For car_type = red and race_id = 1
the query would return 5
as the longest sequence of the lap_no
field.
I found a similar question here however my situation is a bit more straightforward.
(I would also like to know the longest sequence for a given car_type
for all races, but was planning to work that out myself.)
Best Answer
Your description results in a table definition like this:
General solution for this class of problems
To get the longest sequence (1 result, longest of all, arbitrary pick if there are ties):
count(*) FILTER (WHERE step)
only countsTRUE
(= step to next group), which results in a new number for every new group.Related question on SO, one answer featuring a procedural solution with plpgsql:
If the top requirement is performance, the plpgsql function is typically faster in this particular case because it can calculate the result in a single scan.
Faster for consecutive numbers
We can capitalize on the fact that consecutive
lap_no
define a sequence, for a much simpler and faster version:Consecutive laps end up in the same
grp
. Every missing lap results in a lowergrp
per partition.This relies on
(race_id, car_type, lap_no)
beingUNIQUE NOT NULL
. NULL values or duplicates could break the logic.Discussion of Jack's simpler alternative
@Jack's version effectively counts all laps (rows) where the previous
lap_no
in thisrace_id
had the samecar_type
. That's simpler and faster and correct - as long as eachcar_type
can only have one sequence perrace_id
.But for a task that simple the query could be simpler, yet. It would follow logically that all
lap_no
per(car_type, race_id)
must be in sequence, and we could just count the laps:If, on the other hand, one
car_type
can have multiple separate sequences per race_id (and the question does not specify otherwise), Jack's version will fail.Faster for a given race / car type
In reply to the comment / clarifications in the question: restricting the query to one given
(race_id, car_type)
will make it much faster, of course:db<>fiddle here
Old SQL Fiddle
Index
Key to top performance is a fitting index (except for the mentioned procedural solution working with a single sequential scan). A multicolumn index like this serves best:
If your table has the
UNIQUE
constraint I assumed at the top, that is implemented with just this (unique) index internally, and you do not need to create another index.