Postgresql – Allow for dynamic result sets from plpgsql function by filtering based on an array passed in

aggregatearrayplpgsqlpostgresqlpostgresql-10

I think I've avoided the XY problem here, as I'm laying out my solutions to the real underlying problem (summarizing multiple tables in a dynamic fashion) and I'm only asking about the one final part where I got stuck. Therefore, there is a fair bit of background first. I've provided a minimal example dataset, and working code for summarizing the data in the fashion I describe.

Consider a setup as follows:

create temp table tbl1 (id int primary key, category text, passing boolean);

insert into tbl1 values
(1, 'A', 't'),
(2, 'A', 't'),
(3, 'A', 't'),
(4, 'A', 'f'),
(5, 'B', 't'),
(6, 'B', 'f'),
(7, 'C', 't'),
(8, 'C', 't'),
(9, 'C', 'f'),
(10, 'C', 'f'),
(11, 'C', 'f'),
(12, 'C', 'f'),
(13, 'B', 't'),
(14, 'B', 'f'),
(15, 'B', 't'),
(16, 'B', 'f'),
(17, 'B', 't'),
(18, 'B', 'f'),
(19, 'B', 't'),
(20, 'B', 'f');

Then I can produce the following summary:

postgres=> select category, passing, count(*) from tbl1 group by category, passing order by category, passing;
 category | passing | count
----------+---------+-------
 A        | f       |     1
 A        | t       |     3
 B        | f       |     5
 B        | t       |     5
 C        | f       |     4
 C        | t       |     2
(6 rows)

However, I have multiple such tables (all using the same categories A, B, C) that I want to summarize, so the final result I want to have displayed needs to be just a single row to summarize one table, like so:

 Table Name | Overall passing rate | A passing rate | B passing rate | C passing rate
------------+----------------------+----------------+----------------+----------------
 tbl1       | 50% (10/20)          | 75% (3/4)      | 50% (5/10)     | 33% (2/6)

I also need to be able to filter sometimes, such as only returning information about categories A and B and ignoring C, like so:

 Table Name | Overall passing rate | A passing rate | B passing rate
------------+----------------------+----------------+----------------
 tbl1       | 57% (8/14)           | 75% (3/4)      | 50% (5/10)

I can produce the first output shown above with a query using the count(*) filter (where...) syntax in a somewhat kludgy CTE, like so:

with tallies as (
select
count(*) filter (where category in ('A', 'B', 'C') and passing) as abc_pass,
count(*) filter (where category in ('A', 'B', 'C')) as abc_all,
count(*) filter (where category = 'A' and passing) as a_pass,
count(*) filter (where category = 'A') as a_all,
count(*) filter (where category = 'B' and passing) as b_pass,
count(*) filter (where category = 'B') as b_all,
count(*) filter (where category = 'C' and passing) as c_pass,
count(*) filter (where category = 'C') as c_all
from tbl1
)
select 'tbl1' as "Table Name",
format('%s%% (%s/%s)', 100*abc_pass/abc_all, abc_pass, abc_all) as "Overall passing rate",
format('%s%% (%s/%s)', 100*a_pass/a_all, a_pass, a_all) as "A passing rate",
format('%s%% (%s/%s)', 100*b_pass/b_all, b_pass, b_all) as "B passing rate",
format('%s%% (%s/%s)', 100*c_pass/c_all, c_pass, c_all) as "C passing rate"
from tallies;

And I can modify this to omit category C without much difficulty, to produce the second example output above. (Not showing that here as it would be mostly repetitive.)

The trouble is, with so many tables to summarize (actually views, not tables, but that doesn't matter) and with the requirement that I be able to easily summarize any group of the tables ad hoc, and include or omit categories at will (e.g. "summarize tbl1, tbl2 and tbl3, but only categories B and C", or "summarize just category B for all tables") the above SQL isn't flexible enough.

I can accomplish the "summarize any group of tables ad hoc" requirement with a plpgsql function that accepts an arbitrary number of arguments of type name, and feed it the names of all the tables I want to summarize, like so:

create function summarize_tables(variadic tbls name[])
returns table ("Table Name" text, "Overall pass rate" text, "A passing rate" text, "B passing rate" text, "C passing rate" text)
language plpgsql
as $funcdef$
declare
  tbl name;
begin
  foreach tbl in array tbls
  loop
    return query execute
      format(
        $query$
          with tallies as (
            select
              count(*) filter (where category in ('A', 'B', 'C') and passing) as abc_pass,
              count(*) filter (where category in ('A', 'B', 'C')) as abc_all,
              count(*) filter (where category = 'A' and passing) as a_pass,
              count(*) filter (where category = 'A') as a_all,
              count(*) filter (where category = 'B' and passing) as b_pass,
              count(*) filter (where category = 'B') as b_all,
              count(*) filter (where category = 'C' and passing) as c_pass,
              count(*) filter (where category = 'C') as c_all
            from %I
          )
          select
            %L as "Table Name",
            format('%%s%%%% (%%s/%%s)', 100*abc_pass/abc_all, abc_pass, abc_all) as "Overall passing rate",
            format('%%s%%%% (%%s/%%s)', 100*a_pass/a_all, a_pass, a_all) as "A passing rate",
            format('%%s%%%% (%%s/%%s)', 100*b_pass/b_all, b_pass, b_all) as "B passing rate",
            format('%%s%%%% (%%s/%%s)', 100*c_pass/c_all, c_pass, c_all) as "C passing rate"
          from tallies;
        $query$,
        tbl,
        tbl
      );
  end loop;
  return;
end
$funcdef$
;

This can be called with select * from summarize_tables('tbl1'); to summarize the example dataset above, or select * from summarize_tables('tbl1', 'tbl2'); to summarize additional tables.

However, this doesn't accomplish the second requirement at all – that I be able to compute different result columns to include or exclude A, B or C arbitrarily.

I thought perhaps there would be a way to do this with a function signature that would look like so:

create function summarize_tables(categories text[], variadic tbls name[])

And then call it like so:

select * from summarize_tables('{A,B}', 'tbl1', 'tbl2');

But I can't figure out how I could make use of the "categories" array from within my SQL. Is this even possible, to summarize the results in a filtered fashion like this according to the categories passed in?

On a related note, I found https://stackoverflow.com/a/11751557/5419599 so I'm aware that if I want truly dynamic columns to be returned, I'll have to use returns setof record and I'll have to specify the full names and types of the columns to be returned every time I call the function. I would be interested in a workaround for that if there is one.

Possibly the combination of these two factors means that I should just accept that I'll have to have a separate function for each combination of categories A, B, and C that I want to summarize – seven functions total.

But in that case, woe betide me if a category D and category E are added later!

That combinatorial possibility makes me think it could be worth it to have to specify the return column names and types each time I call the function, as a price to pay for only needing to have a single function. In other words, change the returns table (...) in the function definition to returns setof record and then change the call from select * from summarize_tables(...); to:

select * from summarize_tables('{A,C,D}', ...)
as x ("Table Name" text, "Overall pass rate" text, "A passing rate" text, "C passing rate" text, "D passing rate" text)
;

However, this trade-off won't even be possible unless there is a way to make the filtering more dynamic than it is in the current CTE – i.e. a way to make use of a categories text[] parameter passed in. That's what my question is.

(Any pointers regarding the above design would also be welcome, though.)

_{For purposes of this question, I have omitted handling for null "passing" values, which would be handled by changing "where passing" to "where passing is true" – and have omitted case switches to avoid divide by zero errors in case some particular table doesn't contain a particular category.}

Best Answer

I worked out how to do it using unnest(...) with ordinality and returning an array as part of the results.

Here is the function definition:

create function summarize_tables(categories text[], variadic tbls name[])
returns table (tbl name, overall text, by_category text[])
language plpgsql
as $funcdef$
  begin
    foreach tbl in array tbls
    loop
      return query execute format(
        $query$
          with tallies as (
            select
              category,
              count(*) filter (where passing) as passcount,
              count(*) as allcount from %I group by category
          ),
          categories_passed as (
            select * from unnest(%L::text[]) with ordinality as x(category, rn)
          )
          select
            %1$L::name as tbl,
            format('%%s%%%% (%%s/%%s)', (sum(passcount)*100/sum(allcount))::int, sum(passcount), sum(allcount)) as overall,
            array_agg(format('%%s%%%% (%%s/%%s)', passcount*100/allcount, passcount, allcount) order by rn) as by_category
          from categories_passed natural left join tallies;
        $query$,
        tbl,
        categories
      );
    end loop;
    return;
  end
$funcdef$
;

The raw results (with select *) look like:

postgres=> select * from summarize_tables('{A,B,C}', 'tbl1');
 tbl  |   overall   |              by_category
------+-------------+----------------------------------------
 tbl1 | 50% (10/20) | {"75% (3/4)","50% (5/10)","33% (2/6)"}
(1 row)

postgres=> select * from summarize_tables('{A,B}', 'tbl1');
 tbl  |  overall   |        by_category
------+------------+----------------------------
 tbl1 | 57% (8/14) | {"75% (3/4)","50% (5/10)"}
(1 row)

Note that the results that are broken down by category are not sorted alphabetically, but kept in the same order passed in:

postgres=> select * from summarize_tables('{B,A}', 'tbl1');
 tbl  |  overall   |        by_category
------+------------+----------------------------
 tbl1 | 57% (8/14) | {"50% (5/10)","75% (3/4)"}
(1 row)

To get the exact results shown in the question requires column names to be given and the results to be pulled out of the array:

select
  tbl as "Table Name",
  overall as "Overall passing rate",
  by_category[1] as "A passing rate",
  by_category[2] as "B passing rate",
  by_category[3] as "C passing rate"
from summarize_tables('{A,B,C}', 'tbl1');

 Table Name | Overall passing rate | A passing rate | B passing rate | C passing rate
------------+----------------------+----------------+----------------+----------------
 tbl1       | 50% (10/20)          | 75% (3/4)      | 50% (5/10)     | 33% (2/6)
(1 row)

select
  tbl as "Table Name",
  overall as "Overall passing rate",
  by_category[1] as "A passing rate",
  by_category[2] as "B passing rate"
from summarize_tables('{A,B}', 'tbl1');

 Table Name | Overall passing rate | A passing rate | B passing rate
------------+----------------------+----------------+----------------
 tbl1       | 57% (8/14)           | 75% (3/4)      | 50% (5/10)
(1 row)

Not called for in the question, but if it's wanted for the caller of the function to have access to the raw numbers (e.g. for sorting or numeric conditions), then nested arrays can be returned instead, and the formatting can be done by the caller:

create function summarize_tables(categories text[], variadic tbls name[])
returns table (tbl name, overall numeric[], by_category numeric[][])
language plpgsql
as $funcdef$
  begin
    foreach tbl in array tbls
    loop
      return query execute format(
        $query$
          with tallies as (
            select
              category,
              count(*) filter (where passing)::numeric as passcount,
              count(*)::numeric as allcount from %I group by category
          ),
          categories_passed as (
            select * from unnest(%L::text[]) with ordinality as x(category, rn)
          )
          select
            %1$L::name as tbl,
            array[sum(passcount), sum(allcount)] as overall,
            array_agg(array[passcount, allcount] order by rn) as by_category
          from categories_passed natural left join tallies;
        $query$,
        tbl,
        categories
      );
    end loop;
    return;
  end
$funcdef$
;

The SQL to pull out the values from the nested arrays isn't pretty, but it works:

with x as (
  select
    tbl,
    overall as o,
    by_category[1:1] as a,
    by_category[2:2] as b,
    by_category[3:3] as c
  from summarize_tables('{A,B,C}', 'tbl1')
)
select
  tbl,
  format('%s%% (%s/%s)', (100*x.o[1]/x.o[2])::int, x.o[1], x.o[2]) as "Overall passing rate",
  format('%s%% (%s/%s)', (100*x.a[1][1]/x.a[1][2])::int, x.a[1][1], x.a[1][2]) as "A passing rate",
  format('%s%% (%s/%s)', (100*x.b[1][1]/x.b[1][2])::int, x.b[1][1], x.b[1][2]) as "B passing rate",
  format('%s%% (%s/%s)', (100*x.c[1][1]/x.c[1][2])::int, x.c[1][1], x.c[1][2]) as "C passing rate"
from x
;

Related reading about how Postgres arrays work: https://stackoverflow.com/a/34408253/5419599

Best Answer

Related Solutions

Postgresql – Make IN clause behave like AND clause

Postgresql – Returning setof derived from for loop in plpgsql

Related Question