Postgresql – Find changed columns for near-identical rows across tables

dynamic-sqlpostgresql

I run PostgreSQL 9.3.5 on Windows 7, 64-bit.

My data arrives quarterly, in multiple tables (table1, ..., tableN) that are linked, intra-period, by cross-table constraints based on key identifiers. Among other columns, each table has identifiers that persist over time: pfi – persistent feature identifier and ufi – universal feature identifier.

pfi is unique per table (it's exceedingly rare that table1.pfi = table2.pfi.
ufi is unique across all tables and across all time. It's not a hash of the row data, but you could think of it as such.

Each period, in each table, some new pfiare brought into being and some old pfi are retired. Some pfi change attributes. ufi tracks any change to any attribute for a given pfi(row), so to fetch changed (and new) rows for table1 it's simply a matter of:

-- 1st query
select a.*
into vm201512.property_d
from vm201512.property a
where not exists (select 1 from vm201412.property where ufi = a.ufi);

This selects all rows which are either new (new pfi) or changed in at least one column.

About 96% of each table remains unchanged in every respect. Accordingly, in analysing the cross-period changes I build a table that only includes changed and new data. This reduces the table size from ~3.5m rows to ~225k rows: that's a BIG reduction if you subsequently do spatial comparisons with relatively-complex polygons and multiple (spatial and non-spatial) JOINs.

The property table has relatively few columns, so I can identify which elements of the data have changes as follows:

-- 2nd query
create table vm201512.property_d_changes as 
select pfi, 
   case when a.view_pfi=b.view_pfi then 0::int else 1::INT end as view_pfi,
   case when a.status=b.status then 0::int else 1::INT end as status,
   case when a.property_type=b.property_type then 0::int else 1::INT end as property_type,
   -- ... more columns
from vm201512.property_d a -- table created with first query
join vm201412.property b using (pfi);

This gives me a nice table where I can determine precisely what changes happened to a changed (not new) row. I can figure out that pfi 123456 had changes to its propnum and its status; I can figure out how many pfi had changes to their view_pfi – that sort of thing.

Several of the other tables have >50 columns, which makes the case statement unwieldy (I realise it only has to be coded once, but what if the data structure changes?)

Question

With two rows in 2 different tables new.table1, old.table1 where new.table1.pfi = old.table1.pfi and one or more columns different, is there a parsimonious, elegant PostgreSQL statement to figure out the changed columns? Or am I stuck with CASE?

I realise I could write a dynamic function to loop through all columns for a given table, and build the query with CASE statements.

Best Answer

Clarifications

Your comment needs addressing first:

numeric data almost always takes 0 (and text types take '')

The key word here is "almost". As long as it's not "never" (like in "never ever!"), you need to take NULL into account anyway.

no risk of testing NULL=NULL, which would return 1 inappropriately

No it wouldn't. Anything compared to NULL is always NULL even NULL=NULL. Try it. You need to understand NULL comparison.

I think I just need to change sum(col1) to sum(col1::int) to get the number of rows where col1 changed.

If you want to count every case of a.col1 IS DISTINCT FROM b.col1, then you need to work with NULL-safe comparison to begin with. Apart from that, your expression would work. There are many alternatives, depending on the situation:

For absolute performance, is SUM faster or COUNT?

You use select a.* into vm201512 ... in your 1st query. Don't. SELECT INTO .. is discouraged. Use the superior CREATE TABLE AS ... like in your 2nd query.

Creating temporary tables in SQL

Also, Postgres provides pivot functionality in the tablefunc module, but this is not a "pivot" problem at all. Nothing is pivoted here.

The core problem is the dynamic nature of the query due to varying input tables.

Solution

Assuming no NULL values. Where NULL values are possible, use IS NOT DISTINCT FROM instead of =.
Tested in Postgres 9.5. Should work for Postgres 9.1 or later.

You can build your queries like this:

CREATE OR REPLACE FUNCTION f_build_query(_t1 regclass
                                       , _t2 regclass
                                       , _join_col text = 'pfi')
  RETURNS text AS
$func$
SELECT format('SELECT %I, %s FROM %s a JOIN %s b USING (%1$I);'
            , _join_col
            , string_agg(format ('a.%1$I = b.%1$I AS %1$I', attname), ', ' ORDER BY attnum)
            , _t1, _t2)
FROM   pg_attribute
WHERE  attrelid = _t1        -- compare all columns from 1st table
AND    NOT attisdropped      -- no dropped (dead) columns
AND    attnum > 0            -- no system columns
AND    attname <> _join_col  -- exclude "pfi"
$func$  LANGUAGE sql;

Call:

SELECT f_build_query('vm201512.property_d', 'vm201412.property');

Returns a query like this (which you can execute in turn):

SELECT pfi, a.a = b.a AS a, a."weird NaMe" = b."weird NaMe" AS "weird NaMe" -- more ...
FROM vm201512.property_d a JOIN vm201412.property b USING (pfi);

Result:

 pfi | a | b | weird NaMe
-----+---+---+------------
   1 | t | f | t
   2 | f | t | f

Works for arbitrary input tables, and deals with identifiers safely. You can provide table names schema-qualified or not, as you like.

Table name as a PostgreSQL function parameter

Simple dynamic solution

The difficulty is to return varying row types. SQL demands to know the return type at call time. To avoid difficulties, you could return a simple array instead. You get values in the original order of columns, but you don't get column names like in the first query:

CREATE OR REPLACE FUNCTION f_diff_matrix(_t1 regclass
                                       , _t2 regclass
                                       , _join_col text = 'pfi')
  RETURNS TABLE (pfi int, change_matrix bool[]) AS   -- Adapt data type of pfi to your needs!
$func$
BEGIN
   RETURN QUERY EXECUTE (
   SELECT format('SELECT %I, ARRAY[%s] FROM %s a JOIN %s b USING (%1$I)'
               , _join_col
               , string_agg(format ('a.%1$I = b.%1$I', attname), ', ' ORDER BY attnum)
               , _t1, _t2)
   FROM   pg_attribute
   WHERE  attrelid = _t1        -- compare all columns from 1st table
   AND    NOT attisdropped      -- no dropped (dead) columns
   AND    attnum > 0            -- no system columns
   AND    attname <> _join_col  -- exclude "pfi"
   );
END
$func$  LANGUAGE plpgsql;

Call (note the difference!):

SELECT * FROM f_diff_matrix('vm201512.property_d', 'vm201412.property');

Result:

 pfi | change_matrix
-----+---------------
   1 | {t,f,t}  -- one element per column
   2 | {f,t,f}

SQL Fiddle.

You could even make the same function return a dynamic result set for various tables, but I doubt it's worth the complication:

Refactor a PL/pgSQL function to return the output of various SELECT queries

If your really need dynamic pivot functionality (not in this case):

Dynamic alternative to pivot with CASE and GROUP BY

Related Solutions

Postgresql – Schema for lists of relationship differences

If I understand your question correctly you are looking to store, let's say, a Person entity with Name and DateOfBirth attributes. You will also have an Employee entity which is defined as everything a Person has plus EmploymentNumber plus DeskNumber but doesn't have DateOfBirth. I'm guessing you want to be able to add a column to a base type and automatically see it showing up in the sub-type(s), and sub-sub-type(s). In short you are looking for an object-oriented data store. RDBMSs in general are not will suited to deliver this featrure.

If you know what these differences are and they are stable, at least between schema relases, you can define a type/ sub-type model. The base table has the columns which are common to all types. It has a surrogate primary key. From there you define further tables, strictly adding columns as you go. The same ID is carried through this hierarchy. The case where you need to remove columns from one type to a descendant is handled by defining an abstract common ancestor and placing the removed column in one branch but not the other.

For my example above we would end up with these tables

Human
    ID     PrimaryKey
    Name

Person
    ID     Primary key and also foreign key to Human.ID
    DateOfBirth

Employee
    ID     Primary key and also foreign key to Human.ID
    EmploymentNumber
    DeskNumber

You could add an column to the base table (Human) to indicate which of the sub-types any particular occurance writes to. I don't think that is necessary because the process which writes to this database has to know which type it is dealing with in order to capture the correct values. Even if that process dynamically builds its list of attributes by examining the DB, it has to know which ultimate sub-type it is looking for to bootstrap the process.

It is tempting to have, say, EmployeeID and a separate foreign key HumanID. This is unnecessary.

When retrieving values you will either be interested in a known occurrance of a sub-type or will be looking for all available values for an instance i.e. "Human 99 as Employee" or "All about occurrance 99". The former can be had by INNER JOINing Employee up to its ultimate ancestor using the defined keys. The latter by OUTER JOINing all tables in the model.

If your attributes can vary at run time you will be forced into using some variant of an entity-attribute-value (EAV) model. This has been well documented in many, many blogs, papers, forums and SO questions. While being fractious, conceptually challenging and non-performant these models can be made to work (for an appropriate definiton of "work"). Your tables will be a lot like this:

ItemType
    ItemTypeID    primary key
    Name
    ParentTypeID  fk to ItemTypeID

Attribute
    AttributeID   primarykey
    Name
    AddedOrDeletedInd
    ItemTypeID    fk to ItemTypeID

Item
    ItemID
    ItemTypeID    fk to ItemTypeID
    ParentItemID  fk to ItemID

ItemValue
    AttributeID   fk to Attribute
    ItemID        fk to Item
    Value

ItemValue.Value could be a catch-all varchar() column or you could have one for each datatype you want i.e. ValueInt, ValueDate, ValueChar etc. and and indicator in Attribute to say which is populated. As you can see reading any one item's full definition will require recursion through a tree. Good luck.

Postgresql – Optimize PostgreSQL server setting for extremely wide tables

How many of those columns to you use for grouping? If it's relatively few, then I would recommend restructuring the data to be in a long format, where each grouping (category) column is maintained, and is each grouped-by (metric) column is instead jammed into two columns variable and value, similar to how R's reshape2::melt function works. For instance, a table:

id|cat1|cat2|metric1|metric2|

Would become:

id|cat1|cat2|variable|value
id|cat1|cat2|metric1|value of metric 1 column
...
id|cat1|cat2|metric2|value of metric 2 column

The table would become K times longer, with K being the number of metrics you'd like to melt. This can actually improve query performance if you add indexes on your category columns.

If that doesn't speed up performance, then I'd recommend using a different tool than Postgres, such as Apache Spark.