Postgresql – Selecting union of results of dataset-producing function when applied element-wise to an array

ctemany-to-manyoptimizationpostgresqlpsql

This problem is somewhat an academic curiosity. I'm attempting to implement a SQL (i.e. not plpgSQL) function which takes an array of input data, transforms each entry to a set of zero or more rows using some other black-box function, then returns the concatenation or the union of those resulting datasets.

I have a function which takes a value and returns zero or more records:

f: x -> y[]

I have another function which should take an array of x values as its input, and should apply f to each element, returning the union of all the returned record sets:

g: x[] -> y[], returning union or concatenation of { f(x) for each x in x[] }

Initially, I don't care whether the resulting set contains duplicates although it would be preferable if it did not.

I considered using a "recursive" CTE for this to iterate over the array. Using the typical CTE example with tree structures:

create table node(id int primary key, parent_id int);

insert into node values                                                                                                                                                          
    (0, null),                                                                                                                                                                   
    (1, 0),                                                                                                                                                                      
    (2, 0),                                                                                                                                                                      
    (3, 2),                                                                                                                                                                      
    (4, 0),                                                                                                                                                                      
    (5, 1),                                                                                                                                                                      
    (6, 1),                                                                                                                                                                      
    (7, 1),                                                                                                                                                                      
    (8, 2),                                                                                                                                                                      
    (9, 3),                                                                                                                                                                      
    (10, 3);                                                                                                                                                                     

create function f(p_id int)                                                                                                                                                      
returns table (id int, parent_id int)                                                                                                                                            
as $$                                                                                                                                                                            
    select * from node where node.parent_id = p_id;                                                                                                                              
$$ language sql stable;                                                                                                                                                          

create function g(p_ids int[])                                                                                                                                                   
returns table (id int, parent_id int, x int)                                                                                                                                     
as $$                                                                                                                                                                            
    with recursive res(id, parent_id, i) as (                                                                                                                                    
        select null::int, null::int, array_lower(p_ids, 1)                                                                                                                       
        union all                                                                                                                                                                
        select tmp.id, tmp.parent_id, res.i + 1                                                                                                                                  
        from res, f(p_ids[res.i]) as tmp                                                                                                                                         
        where res.i <= array_upper(p_ids, 1)                                                                                                                                     
    ) select res.* from res where id is not null;                                                                                                                                
$$ language sql stable;                                                                                                                                                          

select * from f(2);                                                                                                                                                              

select * from g(ARRAY[2, 1]);  

This works, and I can control whether or not I want duplicates via union/union all, but I'd assume that the explicit iteration would present an optimisation barrier, which could be quite bad if data was large, the array was long, and function f(x) was something stupidly simple such as:

select a, b, c from node where node.id = x

In which case the entire query could be optimised to be equivalent to a simple:

select a, b, c from node where node.id = any (xs)

But presumably couldn't be due to the planner being unable to realise that our iterative CTE part could be translated to a set operation.

I decided to not be lazy and to test this:

create function f2(p_ids int[])                                                                                                                                                  
returns table (id int, parent_id int)                                                                                                                                            
as $$                                                                                                                                                                            
    select * from node where node.parent_id = ANY ( p_ids );                                                                                                                     
$$ language sql stable;

Then with possibly my favourite feature of Postgres:

explain analyze select * from f(2);

 Seq Scan on node  (cost=0.00..38.25 rows=11 width=8) (actual time=0.003..0.004 rows=2 loops=1)
   Filter: (parent_id = 2)
   Rows Removed by Filter: 9
 Planning time: 0.051 ms
 Execution time: 0.012 ms
(5 rows)

explain analyze select * from f2(ARRAY[2, 1]);

 Seq Scan on node  (cost=0.00..38.25 rows=23 width=8) (actual time=0.004..0.005 rows=5 loops=1)
   Filter: (parent_id = ANY ('{2,1}'::integer[]))
   Rows Removed by Filter: 6
 Planning time: 0.055 ms
 Execution time: 0.013 ms
(5 rows)

explain analyze select * from g(ARRAY[2, 1]);

 CTE Scan on res  (cost=424.46..431.28 rows=339 width=12) (actual time=0.026..0.046 rows=8 loops=1)
   Filter: (id IS NOT NULL)
   Rows Removed by Filter: 1
   CTE res
     ->  Recursive Union  (cost=0.00..424.46 rows=341 width=12) (actual time=0.002..0.042 rows=9 loops=1)
           ->  Result  (cost=0.00..0.01 rows=1 width=12) (actual time=0.000..0.000 rows=1 loops=1)
           ->  Hash Join  (cost=0.26..41.76 rows=34 width=12) (actual time=0.010..0.011 rows=3 loops=3)
                 Hash Cond: (node.parent_id = ('{2,1}'::integer[])[res_1.i])
                 ->  Seq Scan on node  (cost=0.00..32.60 rows=2260 width=8) (actual time=0.002..0.003 rows=11 loops=2)
                 ->  Hash  (cost=0.22..0.22 rows=3 width=4) (actual time=0.002..0.002 rows=1 loops=3)
                       Buckets: 1024  Batches: 1  Memory Usage: 8kB
                       ->  WorkTable Scan on res res_1  (cost=0.00..0.22 rows=3 width=4) (actual time=0.001..0.001 rows=1 loops=3)
                             Filter: (i <= 2)
                             Rows Removed by Filter: 2
 Planning time: 0.255 ms
 Execution time: 0.083 ms
(16 rows)

Is there a better way to implement this select-many, ideally a purely set-based method which would not present optimisation barriers?

I'm aware that an index on parent_id would help if the table were much larger however I'm more interested in how the iterative query could be expressed differently to improve performance.

Best Answer

This can be solved via unnest or by modifying the original array-returning function to return a table/setof instead.

The function application can then be done within a join.