bytea
will be optimal for storing the hash.
It'll be transferred in/out of the database as a hex string anyway, unless you use PostgreSQL's binary wire protocol (supported by libpq and partly by PgJDBC) to transfer them.
For best results, store as bytea and have the client application use a PQexecParams
call that requests binary results.
Though on re-reading, this is confusing:
For my implementation, the hash never needs to leave the database, but the hashed data must be compared with external data for existence frequently
Do you mean that the hash isn't transferred for comparison, the original unhashed text data is? If so, the above is irrelevant, as the binary protocol offers no benefits for text-form data.
Also: "tens of billions" of rows is a lot. PostgreSQL has quite a large per-row overhead at 28 bytes, so you're going to be losing a lot of space. Especially once you factor in index overheads too. Is PostgreSQL the right tool for this job?
A final thought: With that many rows, you're getting up into hash-collision territory. Do you care if it's possible - though unlikely - for two different strings to have the same hash, so an incorrect unique violation is reported? If that's a problem then a unique b-tree index on the hash probably isn't the right tool for the job.
Clarifications
Your comment needs addressing first:
numeric data almost always takes 0 (and text types take '')
The key word here is "almost". As long as it's not "never" (like in "never ever!"), you need to take NULL into account anyway.
no risk of testing NULL=NULL
, which would return 1 inappropriately
No it wouldn't. Anything compared to NULL is always NULL even NULL=NULL
. Try it. You need to understand NULL comparison.
I think I just need to change sum(col1)
to sum(col1::int)
to get the number of rows where col1
changed.
If you want to count every case of a.col1 IS DISTINCT FROM b.col1
, then you need to work with NULL-safe comparison to begin with. Apart from that, your expression would work. There are many alternatives, depending on the situation:
You use select a.* into vm201512 ...
in your 1st query. Don't. SELECT INTO ..
is discouraged. Use the superior CREATE TABLE AS ...
like in your 2nd query.
Also, Postgres provides pivot functionality in the tablefunc module, but this is not a "pivot" problem at all. Nothing is pivoted here.
The core problem is the dynamic nature of the query due to varying input tables.
Solution
Assuming no NULL values. Where NULL values are possible, use IS NOT DISTINCT FROM
instead of =
.
Tested in Postgres 9.5. Should work for Postgres 9.1 or later.
You can build your queries like this:
CREATE OR REPLACE FUNCTION f_build_query(_t1 regclass
, _t2 regclass
, _join_col text = 'pfi')
RETURNS text AS
$func$
SELECT format('SELECT %I, %s FROM %s a JOIN %s b USING (%1$I);'
, _join_col
, string_agg(format ('a.%1$I = b.%1$I AS %1$I', attname), ', ' ORDER BY attnum)
, _t1, _t2)
FROM pg_attribute
WHERE attrelid = _t1 -- compare all columns from 1st table
AND NOT attisdropped -- no dropped (dead) columns
AND attnum > 0 -- no system columns
AND attname <> _join_col -- exclude "pfi"
$func$ LANGUAGE sql;
Call:
SELECT f_build_query('vm201512.property_d', 'vm201412.property');
Returns a query like this (which you can execute in turn):
SELECT pfi, a.a = b.a AS a, a."weird NaMe" = b."weird NaMe" AS "weird NaMe" -- more ...
FROM vm201512.property_d a JOIN vm201412.property b USING (pfi);
Result:
pfi | a | b | weird NaMe
-----+---+---+------------
1 | t | f | t
2 | f | t | f
Works for arbitrary input tables, and deals with identifiers safely. You can provide table names schema-qualified or not, as you like.
Simple dynamic solution
The difficulty is to return varying row types. SQL demands to know the return type at call time. To avoid difficulties, you could return a simple array instead. You get values in the original order of columns, but you don't get column names like in the first query:
CREATE OR REPLACE FUNCTION f_diff_matrix(_t1 regclass
, _t2 regclass
, _join_col text = 'pfi')
RETURNS TABLE (pfi int, change_matrix bool[]) AS -- Adapt data type of pfi to your needs!
$func$
BEGIN
RETURN QUERY EXECUTE (
SELECT format('SELECT %I, ARRAY[%s] FROM %s a JOIN %s b USING (%1$I)'
, _join_col
, string_agg(format ('a.%1$I = b.%1$I', attname), ', ' ORDER BY attnum)
, _t1, _t2)
FROM pg_attribute
WHERE attrelid = _t1 -- compare all columns from 1st table
AND NOT attisdropped -- no dropped (dead) columns
AND attnum > 0 -- no system columns
AND attname <> _join_col -- exclude "pfi"
);
END
$func$ LANGUAGE plpgsql;
Call (note the difference!):
SELECT * FROM f_diff_matrix('vm201512.property_d', 'vm201412.property');
Result:
pfi | change_matrix
-----+---------------
1 | {t,f,t} -- one element per column
2 | {f,t,f}
SQL Fiddle.
You could even make the same function return a dynamic result set for various tables, but I doubt it's worth the complication:
If your really need dynamic pivot functionality (not in this case):
Best Answer