Postgresql – Ensuring no unwanted tables have changed after a risky operation is performed

postgresqlpostgresql-11

Sometimes I want to make sure that after a risky operation such as database moving into another server or schema changes via migration scripts no data corruption or loss (due to human error and not the DBMS itself) has occurred to tables that is supposed not to change.

Therefore, I want somehow to check that all my data located in tables are safe and sound in a quick and easy way, therefore I thought it would be good idea to hash each one of my table via a cryptographically secure hash function to current database. Afterwards, once the action has performed, I could recalculate the hashes for each table and check the differences via custom scripts.

The question is how I can hash each table and its data in order to store the hash result?

For example if My database has the following tables:

table1
table2
table3

If I keep hashes on each table then coparing each hashes for each table I could have a quick view which tables has changed. Then if a hash is different on a table that I expect not to, I could smell a data corruption

My database layer is PostgreSQL 11.

Best Answer

psql -Atq -c 'SELECT * FROM atable ORDER BY id' | md5sum

Clarifications

Your comment needs addressing first:

numeric data almost always takes 0 (and text types take '')

The key word here is "almost". As long as it's not "never" (like in "never ever!"), you need to take NULL into account anyway.

no risk of testing NULL=NULL, which would return 1 inappropriately

No it wouldn't. Anything compared to NULL is always NULL even NULL=NULL. Try it. You need to understand NULL comparison.

I think I just need to change sum(col1) to sum(col1::int) to get the number of rows where col1 changed.

If you want to count every case of a.col1 IS DISTINCT FROM b.col1, then you need to work with NULL-safe comparison to begin with. Apart from that, your expression would work. There are many alternatives, depending on the situation:

For absolute performance, is SUM faster or COUNT?

You use select a.* into vm201512 ... in your 1st query. Don't. SELECT INTO .. is discouraged. Use the superior CREATE TABLE AS ... like in your 2nd query.

Creating temporary tables in SQL

Also, Postgres provides pivot functionality in the tablefunc module, but this is not a "pivot" problem at all. Nothing is pivoted here.

The core problem is the dynamic nature of the query due to varying input tables.

Solution

Assuming no NULL values. Where NULL values are possible, use IS NOT DISTINCT FROM instead of =.
Tested in Postgres 9.5. Should work for Postgres 9.1 or later.

You can build your queries like this:

CREATE OR REPLACE FUNCTION f_build_query(_t1 regclass
                                       , _t2 regclass
                                       , _join_col text = 'pfi')
  RETURNS text AS
$func$
SELECT format('SELECT %I, %s FROM %s a JOIN %s b USING (%1$I);'
            , _join_col
            , string_agg(format ('a.%1$I = b.%1$I AS %1$I', attname), ', ' ORDER BY attnum)
            , _t1, _t2)
FROM   pg_attribute
WHERE  attrelid = _t1        -- compare all columns from 1st table
AND    NOT attisdropped      -- no dropped (dead) columns
AND    attnum > 0            -- no system columns
AND    attname <> _join_col  -- exclude "pfi"
$func$  LANGUAGE sql;

Call:

SELECT f_build_query('vm201512.property_d', 'vm201412.property');

Returns a query like this (which you can execute in turn):

SELECT pfi, a.a = b.a AS a, a."weird NaMe" = b."weird NaMe" AS "weird NaMe" -- more ...
FROM vm201512.property_d a JOIN vm201412.property b USING (pfi);

Result:

 pfi | a | b | weird NaMe
-----+---+---+------------
   1 | t | f | t
   2 | f | t | f

Works for arbitrary input tables, and deals with identifiers safely. You can provide table names schema-qualified or not, as you like.

Table name as a PostgreSQL function parameter

Simple dynamic solution

The difficulty is to return varying row types. SQL demands to know the return type at call time. To avoid difficulties, you could return a simple array instead. You get values in the original order of columns, but you don't get column names like in the first query:

CREATE OR REPLACE FUNCTION f_diff_matrix(_t1 regclass
                                       , _t2 regclass
                                       , _join_col text = 'pfi')
  RETURNS TABLE (pfi int, change_matrix bool[]) AS   -- Adapt data type of pfi to your needs!
$func$
BEGIN
   RETURN QUERY EXECUTE (
   SELECT format('SELECT %I, ARRAY[%s] FROM %s a JOIN %s b USING (%1$I)'
               , _join_col
               , string_agg(format ('a.%1$I = b.%1$I', attname), ', ' ORDER BY attnum)
               , _t1, _t2)
   FROM   pg_attribute
   WHERE  attrelid = _t1        -- compare all columns from 1st table
   AND    NOT attisdropped      -- no dropped (dead) columns
   AND    attnum > 0            -- no system columns
   AND    attname <> _join_col  -- exclude "pfi"
   );
END
$func$  LANGUAGE plpgsql;

Call (note the difference!):

SELECT * FROM f_diff_matrix('vm201512.property_d', 'vm201412.property');

Result:

 pfi | change_matrix
-----+---------------
   1 | {t,f,t}  -- one element per column
   2 | {f,t,f}

SQL Fiddle.

You could even make the same function return a dynamic result set for various tables, but I doubt it's worth the complication:

Refactor a PL/pgSQL function to return the output of various SELECT queries

If your really need dynamic pivot functionality (not in this case):

Dynamic alternative to pivot with CASE and GROUP BY

Best Answer

Related Solutions

Optimal Hash Technique for Indexing Large Text in PostgreSQL

PostgreSQL Dynamic SQL – Find Changed Columns for Near-Identical Rows Across Tables

Clarifications

Solution

Simple dynamic solution

Related Question