Postgresql – Create 2D truth table from “cross product” of two tables via PostgreSQL view or function

pivotpostgresqlpostgresql-9.6

I have a working Excel-based method for creating a truth table from two vectors exported from a PostgreSQL database. The process takes about 4 hours to complete due to a large number of VLOOKUP and COUNTIFS operations, so I am looking for a method of implementing this as a view directly in the database.

The input vectors are produced from two existing views in my database, which have no foreign keys.

To make this question and solution as generic as possible, I've created a parallel problem using two simple tables with sample data to cover all possible cases:

CREATE TABLE group_membership
(
  member character varying(6) NOT NULL,
  group_name character varying(64) NOT NULL
);

INSERT INTO group_membership VALUES ('000001','A');
INSERT INTO group_membership VALUES ('000001','B');
INSERT INTO group_membership VALUES ('000001','B'); -- A value may occur more than once.
INSERT INTO group_membership VALUES ('000001','D'); -- A value may not necessarily have a corresponding row in the group table.
INSERT INTO group_membership VALUES ('000001','D');

INSERT INTO group_membership VALUES ('000002','B');
INSERT INTO group_membership VALUES ('000002','C');
INSERT INTO group_membership VALUES ('000002','E');

INSERT INTO group_membership VALUES ('000003','A');
INSERT INTO group_membership VALUES ('000003','C');

INSERT INTO group_membership VALUES ('000004','D');
INSERT INTO group_membership VALUES ('000004','E');

CREATE TABLE groups
(
  name character varying(64) NOT NULL
);

INSERT INTO groups VALUES ('A');
INSERT INTO groups VALUES ('B');
INSERT INTO groups VALUES ('C');
INSERT INTO groups VALUES ('C'); -- A value may occur more than once.
INSERT INTO groups VALUES ('Z');
-- 'D' and 'E' not present in this table

There are no relations between these two tables.

I am trying to construct a view that will create a binary truth table (matrix) like this:

member A B C Z
000001 t t f f
000002 f t t f
000003 t f t f
000004 f f f f

Where the first column are the distinct members from the group_membership table, and the subsequent columns show the presence or absence of the member in only the groups defined in the group table. The resulting table should be boolean only (TRUE if the member appears in a tuple with the group at least once, FALSE otherwise).

For instance, some specific "cells" in the above table would conform to the following:

SELECT COUNT(*) > 0 AS value FROM group_membership WHERE group_name='A' AND member='000001';
 value
-------
 t
(1 row)

SELECT COUNT(*) > 0 AS value FROM group_membership WHERE group_name='Z' AND member='000001';
 value
-------
 f
(1 row)

And to create the second column (the 'A' column):

SELECT COUNT(*) > 0 AS A FROM group_membership WHERE group_name='A' AND member='000001'
 UNION ALL
SELECT COUNT(*) > 0 AS A FROM group_membership WHERE group_name='A' AND member='000002'
 UNION ALL
SELECT COUNT(*) > 0 AS A FROM group_membership WHERE group_name='A' AND member='000003'
 UNION ALL
SELECT COUNT(*) > 0 AS A FROM group_membership WHERE group_name='A' AND member='000004'
;

Even better would be something like this (1 and 0 instead of TRUE and FALSE):

member A B C Z
000001 1 1 0 0
000002 0 1 1 0
000003 1 0 1 0
000004 0 0 0 0

Where the query for each of the individual "cells" could be of the form:

SELECT CASE WHEN COUNT(*) > 0 THEN 1 ELSE 0 END FROM group_membership WHERE group_name='A' AND member='000001';

My group_membership table has about 50,000 rows, and my group table has about 200 rows.


Note: If you do something like the following to ignore groups that are not in common among the two tables, you will end up eliminating rows like 000004 in the above sample result sets, which is not what I am looking for (member 000004 and group Z should be present in the result set):

SELECT * FROM group_membership WHERE group_name IN (SELECT DISTINCT(name) FROM groups);

As a first stab at solving this problem, I am looking into creating a FUNCTION that relies on a recursive JOIN over the group table to construct the result table.

Update: A FUNCTION requires a RETURNS TABLE definition, which looks like it not be a workable solution, given the variable number of columns in the result set. Some additional ideas I have are to create a function that performs a series of UNIONs over one dimension, and is then wrapped with a view that performs a UNION over a crosstab() of the results over SELECT DISTINCT(name) FROM groups ORDER BY name ASC;

Best Answer

It seems you basically want this, without writing this:

SELECT member
      ,bool_or(group_name='A')::int as "A"
      ,bool_or(group_name='B')::int as "B"
      ,bool_or(group_name='C')::int as "C"
      ,bool_or(group_name='Z')::int as "Z" 
  FROM group_membership
  GROUP BY member
  ORDER BY member;

Postgres isn't structured to make dynamic pivot tables easy.

 CREATE or replace FUNCTION prepare() returns void language plpgsql as $F$
 BEGIN
     execute (
          WITH g AS ( select name from groups group by name order by name)
          SELECT
          $$
          create or replace function pg_temp.pivot () 
              RETURNS TABLE ( member text,
          $$ || string_agg( quote_ident( name )||' INT',',') || $$ 
          ) LANGUAGE SQL AS $X$
              SELECT member
    $$ || string_agg( ',bool_or(group_name=' || quote_literal( name ) ||
    ')::int AS '|| quote_ident( name ),e'\n') || $$
                 FROM group_membership
                 GROUP BY member
                 ORDER BY member; $X$; $$
    FROM g ) ;
END;
$F$;


select prepare();
select * from pg_temp.pivot();

 member | A | B | C | Z
--------+---+---+---+---
 000001 | 1 | 1 | 0 | 0
 000002 | 0 | 1 | 1 | 0
 000003 | 1 | 0 | 1 | 0
 000004 | 0 | 0 | 0 | 0
(4 rows)

here I use SQL to form the above query as a temporary function and then pull the result fows from that

in the main function I use a subselect which allows me to use a CTE which means I can order the column names.

I could have created a temporary view instead inside the main function but didn't think of that until just now.

I am making the assumption that the values in group_name are no longer than 64 octets, varchar(64) does not enforce this - the type name does, and is probably a better fit for this task.