PostgreSQL identifier cross-check returns different results for qualitatively-identical query

Setup: PostgreSQL 9.3.5

I have two tables with a common identifier (propid) that is unique in one table (property) and not unique in the second table (addressstring). There are no NULL propid in either table [UPDATE: not true, there were NULL in property.propid], and datatype is INTEGER in both tables.

The following queries – which I assert are functionally identical – return different results:

Query 1:

select distinct propid
  from addressstring
 where propid not in (select distinct propid 
                        from property);
-- returns zero rows

whereas

Query 2:

select distinct propid from addressstring
except 
select distinct propid from property;
-- returns 1478 rows, all non-NULL

The problem persisted after VACUUM ANALYZE (I did that just in case, although the tables have not been altered since import, except for the creation of a GIST index on a geometry column in property).

When I feed Query 2 as a CTE and look for the 1478 identified propid in the property table, sure enough they ain't there –

select propid 
  from property
 where propid in
              (select distinct propid from addressstring
               except 
               select distinct propid from property);
-- returns zero rows

That's as expected, but frankly I can't trust that result given that Query 1 returns zero rows when it should return 1478.

Three part question:

am I wrong in assuming that the two queries are functionally
identical?
If I'm not wrong, is there any sensible, defensible explanation for the discrepancy?

And finally

do I have to re-write every select ... where ... not in (select...) in the EXCEPT structure just to be safe?

That last one, if 'Y', is a gigantic PITA given that select... EXCEPT ... select is not the 'usual' way that most folks get 'A not in B', because often the NOT IN clause is a subset of the set of variables in the primary SELECT – e.g.,

 SELECT a.objectid AS addressoid, 
        a.propid, 
        b.objectid AS propertyoid
   FROM address a
   JOIN property b ON a.propid=b.propid
  WHERE b.propid NOT IN (SELECT propid 
                          WHERE [some condition]);

And yes, the WHERE clause could be written as WHERE NOT EXISTS (SELECT 1 [etc]), but the discrepancy in results is a BIG problem given that the query using NOT IN took a third of a second and the query structure is canonical by the definition in TFD (it is a 'condition that evaluates to a boolean').

Bear in mind – this is not an efficiency issue in the sense that one formulation takes longer than another, valid formulation. This is the fact that two formulations that are set-theoretically identical, produce very different result sets.

UPDATE: the Query Plans produced by EXPLAIN ANALYZE are as follows –

Query 1.

HashAggregate  (cost=21999.81..22497.30 rows=49749 width=4) (actual time=256.699..256.699 rows=0 loops=1)"
  ->  Seq Scan on nclm_addressstring_2015  (cost=11521.03..21738.10 rows=104683 width=4) (actual time=256.646..256.646 rows=0 loops=1)
        Filter: (NOT (hashed SubPlan 1))"
        Rows Removed by Filter: 209366 -- <-- Removed 100% of rows... WHY?????????
        SubPlan 1
          ->  HashAggregate  (cost=9427.01..11102.22 rows=167521 width=4) (actual time=91.192..123.690 rows=166742 loops=1)
                ->  Seq Scan on nclm_property_2015  (cost=0.00..9008.21 rows=167521 width=4) (actual time=0.006..40.914 rows=167521 loops=1)
Total runtime: 261.303 ms

Query 2

HashSetOp Except  (cost=10217.08..25651.99 rows=99497 width=4) (actual time=334.126..336.933 rows=1478 loops=1)
  ->  Append  (cost=10217.08..24984.45 rows=267018 width=4) (actual time=133.840..282.611 rows=334668 loops=1)
        ->  Subquery Scan on *SELECT* 1  (cost=10217.08..12207.01 rows=99497 width=4) (actual time=133.840..174.463 rows=167926 loops=1)
              ->  HashAggregate  (cost=10217.08..11212.05 rows=99497 width=4) (actual time=133.838..166.212 rows=167926 loops=1)
                    ->  Seq Scan on nclm_addressstring_2015  (cost=0.00..9693.66 rows=209366 width=4) (actual time=0.013..78.953 rows=209366 loops=1)
        ->  Subquery Scan on *SELECT* 2  (cost=9427.01..12777.43 rows=167521 width=4) (actual time=59.971..100.664 rows=166742 loops=1)
              ->  HashAggregate  (cost=9427.01..11102.22 rows=167521 width=4) (actual time=59.969..92.796 rows=166742 loops=1)
                    ->  Seq Scan on nclm_property_2015  (cost=0.00..9008.21 rows=167521 width=4) (actual time=0.005..26.279 rows=167521 loops=1)
Total runtime: 343.926 ms

Best Answer

In your 1st query there's a NULL returned by the subquery and you're a victim of Three-valued-logic.

A simple workaround is to add WHERE propid IS NOT NULL. But better switch to NOT EXISTS instead, which treats UNKNOWN as FALSE (i.e. Two-valued-logic):

select distinct propid
  from addressstring AS addr
 where NOT EXISTS
  (
    select * 
    from property AS prop
    where addr.propid = prop.propid
  );

Btw, property.propid sounds like the Primary Key and a PK must be defined as NOT NULL :)

Best Answer

Related Solutions

Postgresql – How to speed up a Postgres query containing lots of Joins with an ILIKE condition

Postgresql – postgres explain plan with giant gaps between operations

Related Question