Postgresql – Find unmatched rows between two tables dynamically

dynamic-sqlexceptjoin;plpgsqlpostgresql

I have a function here that is supposed to take two tables as arguments and check if they are the same.

create or replace function testing.equal_tables(
    varchar,
    varchar)
    returns void as
$$
begin

    execute 'select *
     from
     (select * from ' || $1 ||'
     except
     select * from ' || $2 || ') a
     union
     (select * from ' || $2 || '
     except
     select * from ' || $1 || ');'
    ;

end;
$$ language plpgsql;

When I call it with these two tables, one with 20 rows and one with 10 rows, I get only the empty set, which is not the correct result:

select testing.equal_tables('ee1', 'ee2');

When I modified the function to return a string of the statement, it came back correctly, but that still doesn't help because I'd like to execute the return string using a function, prepared statement, or something. Is there any way to make this function work?

Best Answer

I derive from your example that you only want to match tables with the same structure (compatible row types).

Base query

First of all, your base query is needlessly complex and possibly incorrect. Consider instead:

(TABLE a EXCEPT ALL TABLE b)  -- ALL keeps duplicate rows and is faster
UNION ALL
(TABLE b EXCEPT ALL TABLE a);

I doubt you want to fold duplicates. Typically, tables do not hold complete duplicates to begin with, and omitting the ALL key word would instruct Postgres to try and fold duplicates in every step, which is just a waste of time. Details:

Select rows which are not present in other table

Even if completely duplicate rows are possible, it might be best to return all of them to avoid misleading answers. If you want to fold duplicates in the result anyway, a single UNION does the job:

(TABLE a EXCEPT ALL TABLE b)
UNION
(TABLE b EXCEPT ALL TABLE a);

However, a query with NATURAL FULL OUTER JOIN is more efficient, doing almost the same:

SELECT *
FROM   a NATURAL FULL OUTER JOIN b
WHERE  a IS NULL OR b IS NULL;

There are two subtle, exotic corner cases:

This returns rows that are all NULL to begin with. It would be an odd table design that allows such rows, though.
This does not return duplicate rows that are not matched with the same number of duplicate rows in the other table: a single match is enough to eliminate all - which is subtly different from folding duplicates in the result! Have a look at the fiddle below. Again, it would be an odd table design that allows completely duplicate rows.

Function

What you have so far does not work for several reasons.

To return actual table rows dynamically (and not just a count or a text representation), you need to use a polymorphic type.

Since the second table is bound to have a compatible row type (as per my assumption) it's enough to hand in just the table name for that.

CREATE OR REPLACE FUNCTION f_tbl_diff(_tbl1_type ANYELEMENT, _tbl2 text)
  RETURNS SETOF ANYELEMENT AS
$func$
BEGIN
   RETURN QUERY EXECUTE format('
      SELECT *
      FROM   %1$s NATURAL FULL OUTER JOIN %2$I
      WHERE  %1$s IS NULL OR %2$I IS NULL'
    , pg_typeof(_tbl1_type), _tbl2);
END
$func$  LANGUAGE plpgsql;

Call:

SELECT * FROM f_tbl_diff(NULL::a, 'b');

Note the special syntax for the first argument! We hand in an actual row defining the return type, not just a table name. Related answer on SO with ample details (scroll to the last chapter):

Refactor a PL/pgSQL function to return the output of various SELECT queries

This returns all rows that have no exact, complete match in the other table

SQL Fiddle also demonstrating exotic corner cases.

About the dynamic SQL:

SQL injection in Postgres functions vs prepared queries

Related Solutions

Sql-server – Easily show rows that are different between two tables or queries

You don't need 30 join conditions for a FULL OUTER JOIN here.

You can just Full Outer Join on the PK, preserve rows with at least one difference with WHERE EXISTS (SELECT A.* EXCEPT SELECT B.*) and use CROSS APPLY (SELECT A.* UNION ALL SELECT B.*) to unpivot out both sides of the JOINed rows into individual rows.

WITH TableA(Col1, Col2, Col3) 
     AS (SELECT 'Dog',1,1     UNION ALL 
         SELECT 'Cat',27,86   UNION ALL 
         SELECT 'Cat',128,92), 
     TableB(Col1, Col2, Col3) 
     AS (SELECT 'Dog',1,1     UNION ALL 
         SELECT 'Cat',27,105  UNION ALL 
         SELECT 'Lizard',83,NULL) 
SELECT CA.*
FROM   TableA A 
       FULL OUTER JOIN TableB B 
         ON A.Col1 = B.Col1 
            AND A.Col2 = B.Col2 
/*Unpivot the joined rows*/
CROSS APPLY (SELECT 'TableA' AS what, A.* UNION ALL
             SELECT 'TableB' AS what, B.*) AS CA     
/*Exclude identical rows*/
WHERE  EXISTS (SELECT A.* 
               EXCEPT 
               SELECT B.*) 
/*Discard NULL extended row*/
AND CA.Col1 IS NOT NULL      
ORDER BY CA.Col1, CA.Col2

Gives

what   Col1   Col2        Col3
------ ------ ----------- -----------
TableA Cat    27          86
TableB Cat    27          105
TableA Cat    128         92
TableB Lizard 83          NULL

Or a version dealing with the moved goalposts.

SELECT DISTINCT CA.*
FROM   TableA A 
       FULL OUTER JOIN TableB B 
         ON EXISTS (SELECT A.*  INTERSECT  SELECT B.*) 
CROSS APPLY (SELECT 'TableA' AS what, A.* UNION ALL
             SELECT 'TableB' AS what, B.*) AS CA     
WHERE NOT EXISTS (SELECT A.*  INTERSECT  SELECT B.*) 
AND CA.Col1 IS NOT NULL
ORDER BY CA.Col1, CA.Col2

For tables with many columns it can still be difficult to identify the specific column(s) that differ. For that you can potentially use the below.

(though just on relatively small tables as otherwise this method likely won't have adequate performance)

SELECT t1.primary_key,
       y1.c,
       y1.v,
       y2.v
FROM   t1
       JOIN t2
         ON t1.primary_key = t2.primary_key
       CROSS APPLY (SELECT t1.*
                    FOR xml path('row'), elements xsinil, type) x1(x)
       CROSS APPLY (SELECT t2.*
                    FOR xml path('row'), elements xsinil, type) x2(x)
       CROSS APPLY (SELECT n.n.value('local-name(.)', 'sysname'),
                           n.n.value('.', 'nvarchar(max)')
                    FROM   x1.x.nodes('row/*') AS n(n)) y1(c, v)
       CROSS APPLY (SELECT n.n.value('local-name(.)', 'sysname'),
                           n.n.value('.', 'nvarchar(max)')
                    FROM   x2.x.nodes('row/*') AS n(n)) y2(c, v)
WHERE  y1.c = y2.c
       AND EXISTS(SELECT y1.v
                  EXCEPT
                  SELECT y2.v)

PostgreSQL Return Joined Results of Select

Use RETURN QUERY SELECT ...;.

PostgreSQL is complaining because you aren't specifying what to return. Also, since you aren't actually passing parameters to the function this could also be accomplished by a view.