PostgreSQL – Joining Multiple Subqueries

join;postgresqlself-join

I need a query that returns something in this format:

pk  id  v1 v2 v3 v4 v5
----------------------
...
pk1 id1 A1 A2 A3 A4 A5
pk6 id2 B1 B2 B3 B4 B5
...

My current data looks like this

CREATE TABLE foo(pk,id,value)
AS VALUES
    ( 'pk1' , 'id1',  'A1' ),
    ( 'pk2' , 'id1',  'A2' ),
    ( 'pk3' , 'id1',  'A3' ),
    ( 'pk4' , 'id1',  'A4' ),
    ( 'pk5' , 'id1',  'A5' ),
    ( 'pk6' , 'id2',  'B1' ),
    ( 'pk7' , 'id2',  'B2' ),
    ( 'pk8' , 'id2',  'B3' ),
    ( 'pk9' , 'id2',  'B4' ),
    ( 'pk10', 'id2',  'B5' )
;

The source table has ~160,000,000 rows. All columns are indexed (btree).

Currently I'm doing 5 (the real number is dynamic and maybe around 30, but for example above it's 5) different queries to extract:

1:

P1 id1 A1
P6 id2 B1

2:

P2 id1 A2
P7 id2 B2

3:

P3 id1 A3
P8 id2 B4

etc

and then join these queries. Something like this:

SELECT q1.pk,q1.id,q1.v1,q2.v2,q3.v3,q4.v4,q5.v5 FROM (SELECT pk, id, value FROM table WHERE id=1) AS q1, (SELECT pk, id, value FROM table WHERE id=2) AS q2, (SELECT pk, id, value FROM table WHERE id=3) AS q3, (SELECT pk, id, value FROM table WHERE id=4) AS q4, (SELECT pk, id, value FROM table WHERE id=5) AS q5 WHERE q1.id=q2.id and q2.id=q3.id and q3.id=q4.id and q4.id=q5.id;

But as I mentioned above number of joins are much more (~30) and it runs really slow on Postgres. I examined the planning and disabled nestloop and it got much faster but it's still too slow. If I fetch all these queries to memory and join them programmatically (using python for example) it runs in ~1 second. But Postgres takes too long (~30 seconds for joining 110 rows).
Any ideas? I think there must be a better way than making 30 subqueries and joining them. Another approach would be grouping rows based on id column and do some magic (for example with RowNumber()) to generate the desired table.

P.S. It's not applicable to pre-generate the result table. Everything must happen runtime.

Best Answer

For your example, all you want is

SELECT
  min(pk),
  id,
  array_agg(value ORDER BY value)
FROM foo
GROUP BY id;

 min  | id  |    array_agg     
------+-----+------------------
 pk1  | id1 | {A1,A2,A3,A4,A5}
 pk10 | id2 | {B1,B2,B3,B4,B5}
(2 rows)

If you need the array_agg to be unwrapped, you can do,

SELECT pk, id, a[1] AS v1, a[2] AS v2, a[3] AS v3, a[4] AS v4, a[5] AS v5
FROM (
  SELECT
    min(pk),
    id,
    array_agg(value ORDER BY value)
  FROM foo
  GROUP BY id
) AS t(pk, id, a);

  pk  | id  | v1 | v2 | v3 | v4 | v5 
------+-----+----+----+----+----+----
 pk1  | id1 | A1 | A2 | A3 | A4 | A5
 pk10 | id2 | B1 | B2 | B3 | B4 | B5

Note, in all future examples please don't use id1 and pk1. You're sample values should preferably be just 1 and 2. As you can see here, we're struggling with getting pk10 instead of pk6 and that's a problem only because of the sort order.