Postgresql – Efficiently select beginning and end of multiple contiguous ranges in Postgresql query

postgresqlquery

I've got about a billion rows of data in a table with a name and an integer in the range 1-288. For a given name, every int is unique, and not every possible integer in the range is present–so there are gaps.

This query generates an example case:

--what I have:
SELECT *
FROM ( VALUES ('foo', 2),
              ('foo', 3),
              ('foo', 4),
              ('foo', 10),
              ('foo', 11),
              ('foo', 13),
              ('bar', 1),
              ('bar', 2),
              ('bar', 3)
     ) AS baz ("name", "int")

I'd like to generate a lookup table with a row for each name and sequence of contiguous integers. Each such row would contain:

name — the value of the name column
start — the first integer in the contiguous sequence
end — the final value in the contiguous sequence
span — end – start + 1

This query generates example output for the above example:

--what I need:
SELECT * 
FROM ( VALUES ('foo', 2, 4, 3),
              ('foo', 10, 11, 2),
              ('foo', 13, 13, 1),
              ('bar', 1, 3, 3)
     ) AS contiguous_ranges ("name", "start", "end", span)

Because I have so many rows, more efficient is better. That said, I only have to run this query once, so it isn't an absolute requirement.

Thanks in advance!

Edit:

I should add that PL/pgSQL solutions are welcome (please explain any Fancy Tricks–I'm still new to PL/pgSQL).

Best Answer

How about using with recursive

test view:

create view v as 
select *
from ( values ('foo', 2),
              ('foo', 3),
              ('foo', 4),
              ('foo', 10),
              ('foo', 11),
              ('foo', 13),
              ('bar', 1),
              ('bar', 2),
              ('bar', 3)
     ) as baz ("name", "int");

query:

with recursive t("name", "int") as ( select "name", "int", 1 as span from v
                                     union all
                                     select "name", v."int", t.span+1 as span
                                     from v join t using ("name")
                                     where v."int"=t."int"+1 )
select "name", "start", "start"+span-1 as "end", span
from( select "name", ("int"-span+1) as "start", max(span) as span
      from ( select "name", "int", max(span) as span 
             from t
             group by "name", "int" ) z
      group by "name", ("int"-span+1) ) z;

result:

 name | start | end | span
------+-------+-----+------
 foo  |     2 |   4 |    3
 foo  |    13 |  13 |    1
 bar  |     1 |   3 |    3
 foo  |    10 |  11 |    2
(4 rows)

I'd be interested to know how that performs on your billion row table.

Related Solutions

Mysql – How to select the latest record having one state where no later records exist with any other state

SELECT widget, MAX(`timestamp`) AS ts
FROM tableX AS t
WHERE state = 'down'
GROUP BY widget
HAVING NOT EXISTS
       ( SELECT *
         FROM tableX AS tt
         WHERE tt.widget = t.widget
           AND tt.state <> 'down'
           AND tt.`timestamp` > MAX(t.`timestamp`)
       ) ;

I think that you'll need two indices, one on (widget, state, timestamp) and one on (widget, timestamp, state) for efficiency.

This will work, too, and will be needing only one index, on (widget, timestamp, state):

SELECT t.widget, t.`timestamp`
FROM 
        tableX AS t
    JOIN
        ( SELECT widget, MAX(`timestamp`) AS ts
          FROM tableX
          GROUP BY widget
        ) AS tm
            ON  tm.widget = t.widget
            AND tm.ts = t.`timestamp`
WHERE t.state = 'down' ;

Tested both at SQL-Fiddle: test

Postgresql – How to implement a sequence for each foreign key value

How about using a column in the table account called max_customer_id and updating that every time there's a new customer? That should be the simplest solution available that I can think of.

With this solution you have to take care of the concurrency. Take a look at the document PostgreSQL Concurrency issues:

If two concurrent transactions try to insert rows having the same key value, then the second one will block until the first one finishes. If the first transaction commits, the second one must abort because of the uniqueness constraint; but if the first one aborts the second one can proceed.

If you want to assign a serial number to each row in a table, you might try
INSERT INTO mytable (id, ...) VALUES( (SELECT MAX(id) + 1 FROM mytable), ...);
This will not work safely unless you take an explicit lock on the whole table, which will prevent concurrent insertions. (It’ll also be quite slow, because MAX scans the whole table in PostgreSQL.)

Alternatively you could place the value in a separate table. This table would contain the account_id and maximum account_specific_id in use for that account. That way you could handle the concurrency easier with UPDATEs. Same source:

A variant is to use a single-row table to hold the next ID number to assign:
SELECT next FROM mytable counter FOR UPDATE; UPDATE mytable counter SET next = $next + 1; INSERT INTO mytable (id, ...) VALUES($next, ...);
This works (as long as you use FOR UPDATE), but you still have the problem that only one insertion transaction can proceed at a time. The implicit write lock on the counter row is the bottleneck.
[The author goes on recommending sequences.]

The number 2 option with sequences is not however recommended in this case as new sequences are not meant to be routinely inserted in a production environment.

Best Answer

Related Solutions

Mysql – How to select the latest record having one state where no later records exist with any other state

Postgresql – How to implement a sequence for each foreign key value

Related Question