Postgresql – Efficiently select beginning and end of multiple contiguous ranges in Postgresql query

postgresqlquery

I've got about a billion rows of data in a table with a name and an integer in the range 1-288. For a given name, every int is unique, and not every possible integer in the range is present–so there are gaps.

This query generates an example case:

--what I have:
SELECT *
FROM ( VALUES ('foo', 2),
              ('foo', 3),
              ('foo', 4),
              ('foo', 10),
              ('foo', 11),
              ('foo', 13),
              ('bar', 1),
              ('bar', 2),
              ('bar', 3)
     ) AS baz ("name", "int")

I'd like to generate a lookup table with a row for each name and sequence of contiguous integers. Each such row would contain:

name — the value of the name column
start — the first integer in the contiguous sequence
end — the final value in the contiguous sequence
spanend – start + 1

This query generates example output for the above example:

--what I need:
SELECT * 
FROM ( VALUES ('foo', 2, 4, 3),
              ('foo', 10, 11, 2),
              ('foo', 13, 13, 1),
              ('bar', 1, 3, 3)
     ) AS contiguous_ranges ("name", "start", "end", span)

Because I have so many rows, more efficient is better. That said, I only have to run this query once, so it isn't an absolute requirement.

Thanks in advance!

Edit:

I should add that PL/pgSQL solutions are welcome (please explain any Fancy Tricks–I'm still new to PL/pgSQL).

Best Answer

How about using with recursive

test view:

create view v as 
select *
from ( values ('foo', 2),
              ('foo', 3),
              ('foo', 4),
              ('foo', 10),
              ('foo', 11),
              ('foo', 13),
              ('bar', 1),
              ('bar', 2),
              ('bar', 3)
     ) as baz ("name", "int");

query:

with recursive t("name", "int") as ( select "name", "int", 1 as span from v
                                     union all
                                     select "name", v."int", t.span+1 as span
                                     from v join t using ("name")
                                     where v."int"=t."int"+1 )
select "name", "start", "start"+span-1 as "end", span
from( select "name", ("int"-span+1) as "start", max(span) as span
      from ( select "name", "int", max(span) as span 
             from t
             group by "name", "int" ) z
      group by "name", ("int"-span+1) ) z;

result:

 name | start | end | span
------+-------+-----+------
 foo  |     2 |   4 |    3
 foo  |    13 |  13 |    1
 bar  |     1 |   3 |    3
 foo  |    10 |  11 |    2
(4 rows)

I'd be interested to know how that performs on your billion row table.