Postgresql – Create 2D truth table from “cross product” of two tables via PostgreSQL view or function

pivotpostgresqlpostgresql-9.6

I have a working Excel-based method for creating a truth table from two vectors exported from a PostgreSQL database. The process takes about 4 hours to complete due to a large number of VLOOKUP and COUNTIFS operations, so I am looking for a method of implementing this as a view directly in the database.

The input vectors are produced from two existing views in my database, which have no foreign keys.

To make this question and solution as generic as possible, I've created a parallel problem using two simple tables with sample data to cover all possible cases:

CREATE TABLE group_membership
(
  member character varying(6) NOT NULL,
  group_name character varying(64) NOT NULL
);

INSERT INTO group_membership VALUES ('000001','A');
INSERT INTO group_membership VALUES ('000001','B');
INSERT INTO group_membership VALUES ('000001','B'); -- A value may occur more than once.
INSERT INTO group_membership VALUES ('000001','D'); -- A value may not necessarily have a corresponding row in the group table.
INSERT INTO group_membership VALUES ('000001','D');

INSERT INTO group_membership VALUES ('000002','B');
INSERT INTO group_membership VALUES ('000002','C');
INSERT INTO group_membership VALUES ('000002','E');

INSERT INTO group_membership VALUES ('000003','A');
INSERT INTO group_membership VALUES ('000003','C');

INSERT INTO group_membership VALUES ('000004','D');
INSERT INTO group_membership VALUES ('000004','E');

CREATE TABLE groups
(
  name character varying(64) NOT NULL
);

INSERT INTO groups VALUES ('A');
INSERT INTO groups VALUES ('B');
INSERT INTO groups VALUES ('C');
INSERT INTO groups VALUES ('C'); -- A value may occur more than once.
INSERT INTO groups VALUES ('Z');
-- 'D' and 'E' not present in this table

There are no relations between these two tables.

I am trying to construct a view that will create a binary truth table (matrix) like this:

member A B C Z
000001 t t f f
000002 f t t f
000003 t f t f
000004 f f f f

Where the first column are the distinct members from the group_membership table, and the subsequent columns show the presence or absence of the member in only the groups defined in the group table. The resulting table should be boolean only (TRUE if the member appears in a tuple with the group at least once, FALSE otherwise).

For instance, some specific "cells" in the above table would conform to the following:

SELECT COUNT(*) > 0 AS value FROM group_membership WHERE group_name='A' AND member='000001';
 value
-------
 t
(1 row)

SELECT COUNT(*) > 0 AS value FROM group_membership WHERE group_name='Z' AND member='000001';
 value
-------
 f
(1 row)

And to create the second column (the 'A' column):

SELECT COUNT(*) > 0 AS A FROM group_membership WHERE group_name='A' AND member='000001'
 UNION ALL
SELECT COUNT(*) > 0 AS A FROM group_membership WHERE group_name='A' AND member='000002'
 UNION ALL
SELECT COUNT(*) > 0 AS A FROM group_membership WHERE group_name='A' AND member='000003'
 UNION ALL
SELECT COUNT(*) > 0 AS A FROM group_membership WHERE group_name='A' AND member='000004'
;

Even better would be something like this (1 and 0 instead of TRUE and FALSE):

member A B C Z
000001 1 1 0 0
000002 0 1 1 0
000003 1 0 1 0
000004 0 0 0 0

Where the query for each of the individual "cells" could be of the form:

SELECT CASE WHEN COUNT(*) > 0 THEN 1 ELSE 0 END FROM group_membership WHERE group_name='A' AND member='000001';

My group_membership table has about 50,000 rows, and my group table has about 200 rows.

Note: If you do something like the following to ignore groups that are not in common among the two tables, you will end up eliminating rows like 000004 in the above sample result sets, which is not what I am looking for (member 000004 and group Z should be present in the result set):

SELECT * FROM group_membership WHERE group_name IN (SELECT DISTINCT(name) FROM groups);

As a first stab at solving this problem, I am looking into creating a FUNCTION that relies on a recursive JOIN over the group table to construct the result table.

Update: A FUNCTION requires a RETURNS TABLE definition, which looks like it not be a workable solution, given the variable number of columns in the result set. Some additional ideas I have are to create a function that performs a series of UNIONs over one dimension, and is then wrapped with a view that performs a UNION over a crosstab() of the results over SELECT DISTINCT(name) FROM groups ORDER BY name ASC;

Best Answer

It seems you basically want this, without writing this:

SELECT member
      ,bool_or(group_name='A')::int as "A"
      ,bool_or(group_name='B')::int as "B"
      ,bool_or(group_name='C')::int as "C"
      ,bool_or(group_name='Z')::int as "Z" 
  FROM group_membership
  GROUP BY member
  ORDER BY member;

Postgres isn't structured to make dynamic pivot tables easy.

 CREATE or replace FUNCTION prepare() returns void language plpgsql as $F$
 BEGIN
     execute (
          WITH g AS ( select name from groups group by name order by name)
          SELECT
          $$
          create or replace function pg_temp.pivot () 
              RETURNS TABLE ( member text,
          $$ || string_agg( quote_ident( name )||' INT',',') || $$ 
          ) LANGUAGE SQL AS $X$
              SELECT member
    $$ || string_agg( ',bool_or(group_name=' || quote_literal( name ) ||
    ')::int AS '|| quote_ident( name ),e'\n') || $$
                 FROM group_membership
                 GROUP BY member
                 ORDER BY member; $X$; $$
    FROM g ) ;
END;
$F$;


select prepare();
select * from pg_temp.pivot();

 member | A | B | C | Z
--------+---+---+---+---
 000001 | 1 | 1 | 0 | 0
 000002 | 0 | 1 | 1 | 0
 000003 | 1 | 0 | 1 | 0
 000004 | 0 | 0 | 0 | 0
(4 rows)

here I use SQL to form the above query as a temporary function and then pull the result fows from that

in the main function I use a subselect which allows me to use a CTE which means I can order the column names.

I could have created a temporary view instead inside the main function but didn't think of that until just now.

I am making the assumption that the values in group_name are no longer than 64 octets, varchar(64) does not enforce this - the type name does, and is probably a better fit for this task.

Related Solutions

Postgresql – Find “n” consecutive free numbers from table

This is a gaps-and-islands problem. Assuming there are no gaps or duplicates in the same id_set set:

WITH partitioned AS (
  SELECT
    *,
    number - ROW_NUMBER() OVER (PARTITION BY id_set) AS grp
  FROM atable
  WHERE status = 'FREE'
),
counted AS (
  SELECT
    *,
    COUNT(*) OVER (PARTITION BY id_set, grp) AS cnt
  FROM partitioned
)
SELECT
  id_set,
  number
FROM counted
WHERE cnt >= 3
;

Here's a SQL Fiddle demo^* link for this query: http://sqlfiddle.com/#!1/a2633/1.

UPDATE

To return only one set, you could add in one more round of ranking:

WITH partitioned AS (
  SELECT
    *,
    number - ROW_NUMBER() OVER (PARTITION BY id_set) AS grp
  FROM atable
  WHERE status = 'FREE'
),
counted AS (
  SELECT
    *,
    COUNT(*) OVER (PARTITION BY id_set, grp) AS cnt
  FROM partitioned
),
ranked AS (
  SELECT
    *,
    RANK() OVER (ORDER BY id_set, grp) AS rnk
  FROM counted
  WHERE cnt >= 3
)
SELECT
  id_set,
  number
FROM ranked
WHERE rnk = 1
;

Here's a demo for this one too: http://sqlfiddle.com/#!1/a2633/2.

If you ever need to make it one set per id_set, change the RANK() call like this:

RANK() OVER (PARTITION BY id_set ORDER BY grp) AS rnk

Additionally, you could make the query return the smallest matching set (i.e. first try to return the first set of exactly three consecutive numbers if it exists, otherwise four, five etc.), like this:

RANK() OVER (ORDER BY cnt, id_set, grp) AS rnk

or like this (one per id_set):

RANK() OVER (PARTITION BY id_set ORDER BY cnt, grp) AS rnk

_{* The SQL Fiddle demos linked in this answer use the 9.1.8 instance as the 9.2.1 one doesn't appear to be working at the moment.}

Mysql – Pivot table with MySQL

Give this a try,tested on my machine

First make sure GROUP_CONCAT has plenty of space:

SET SESSION group_concat_max_len = 1000000;

SET @sql = NULL;
SELECT
GROUP_CONCAT(DISTINCT
    CONCAT(
      'COUNT(DISTINCT CASE WHEN buyer_country= ''',
      buyer_country,
      ''' THEN transactions.buyer_id END) AS ',
      buyer_country
    )
   )INTO @sql
FROM  transactions 
LEFT JOIN buyers on transactions.buyer_id=buyers.buyer_id;


SET @sql=CONCAT('SELECT transactions.seller_country,',@sql,' FROM  transactions 
LEFT JOIN buyers on transactions.buyer_id=buyers.buyer_id GROUP BY transactions.seller_country');



PREPARE stmt FROM @sql;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;

Link

Edit

ALTER TABLE transactions ADD key trans_idx (buyer_id)
ALTER TABLE buyers ADD key buyers_idx (buyer_id)

Best Answer

Related Solutions

Postgresql – Find “n” consecutive free numbers from table

Mysql – Pivot table with MySQL

Related Question