Chaining Together Entries into Groups in SQLite

sqlite

I have a table of items pairs, which links one item with another. I'd like to divide the items into groups, so that any items which are linked together, even indirectly, are in the same group. So:

person1| person2
-------+--------
john   | george
george | paul
george | ringo
ringo  | paul
brian  | mick
brian  | keith
keith  | bill
mick   | charlie

For the purposes of this, assume the table schema is just:

create table pairs 
  ( person1 varchar not null,
    person2 varchar not null,
    primary key (item1, item2)
  ) ;
create table p2g (person varchar not null, groupid int);
create table group (groupid integer primary key)

It's obviously more complex than that.

So john is linked to george, and george is linked to paul and ringo, and ringo is linked to paul; these are a self-contained group. (Note that the ringo<->paul link is actually redundant here; we don't need it to make the group, but it's present anyway.) Similarly, brian/mick/keith/bill/charlie are a group. People are unique (we're using varchar names here to make the explanation clearer, but people are actually identified by a unique ID). So what I want is some way of pulling all these together, so I'm able to create output that looks like

person  | group
--------+------
john    | 1
george  | 1
paul    | 1
ringo   | 1
brian   | 2
charlie | 2
keith   | 2
mick    | 2
bill    | 2

It is obviously possible to walk through the pairs table line-by-line (and that's what I'm doing), but this is pretty slow with a lot of pairs. The algorithm I'm using is this:

 for person1, person2 in pairs:
    group1 = db.exec("select groupid from p2g where person=:person1")
    group2 = db.exec("select groupid from p2g where person=:person2")
    if not group1 and not group2:
        # invent a new group; add person1 and person2 to it
        newgroupid = db.exec("insert into group (null)")
        db.exec("insert into p2g (:newgroupid, :person1)")
        db.exec("insert into p2g (:newgroupid, :person2)")
    else if not group1:
        # add person1 to group2
        db.exec("insert into p2g (:group2, :person1)")
    else if not group2:
        # add person2 to group1
        db.exec("insert into p2g (:group1, :person2)")
    else if group1 == group2:
        nothing to do here, everything's already OK
    else:
        # they're in two different groups. 
        # Put everyone in group2 into group1
        # and delete group2
        db.exec("update p2g set groupid=:group1 where groupid=:group2")
        db.exec("delete from group where groupid=:group2")

but I feel like there's a more SQLish way to do this in batches or all in one go, rather than once per row. Is there? (I'm using SQLite, if that makes a difference.)

I've tried using recursive CTEs for this, which sorta works but sorta doesn't. This single query:

with bands (member, leader) as (
    select item1, item1 from pairs 
    union all 
    select p.item2, b.leader 
        from pairs p 
        inner join bands b on b.member=p.item1
) select distinct 'group:'||leader, member from bands order by leader

gives as output

group:john  | john
group:john  | george
group:john  | paul
group:john  | ringo
group:mick  | mick
group:mick  | charlie
group:mick  | keith
group:paul  | paul
group:paul  | ringo
group:paul  | george
group:ringo | ringo
group:ringo | george

which is almost right: the john and mick groups are correct! But it also contains other "groups", started from other members, and I don't know how to not include those. This seems like the right approach, but it isn't quite there. (Additionally, if the pairs table also contains "ringo -> john", i.e., the chain of people is circular, then the above query runs forever, which is also not good.)

Best Answer

The output of your CTE is inconsistent: john, paul, and ringo have different numbers of band members. You should go through the pairs in both directions (and then use UNION to prevent duplicates and infinite loops):

with bands (member, leader) as (
    select item1, item2 from pairs
    union
    select item2, item1 from pairs
    union
    select ... -- recursion
) ...

Now you have all possible combinations of leaders and band members, but you want only one (random but unique) leader per band. So for all rows with the same band member, choose the one with the smallest leader:

with bands as (...)
select 'group:'||min(leader), member
from bands
group by member
order by min(leader);