PostgreSQL Performance – Bad Performance with SELECT on View Checking IS NULL

performancepostgresqlpostgresql-9.3postgresql-performanceview

I have two quite simple tables, and a view to join them on bigint fields:

CREATE TABLE foo.ao (
  id serial NOT NULL,
  ao_id bigint NOT NULL,
  ao_plz text,
  ao_community text,
  ao_street text,
  ao_ms_id text,
  ao_status text DEFAULT 'Standard'::character varying,
  ao_has_eear boolean,
  ao_last_update timestamp without time zone,
  CONSTRAINT ao_pkey PRIMARY KEY (id),
  CONSTRAINT ao_id_unique UNIQUE (ao_id)
);

CREATE TABLE foo.ms (
  id serial NOT NULL,
  ms_nis_number text NOT NULL,
  ms_plz text,
  ms_community text,
  ms_street text,
  ms_status text DEFAULT 'Standard'::character varying,
  ms_coord_x integer,
  ms_coord_y integer,
  ms_ao_id bigint,
  CONSTRAINT ms_pkey PRIMARY KEY (id),
  CONSTRAINT nis_number_unique UNIQUE (ms_nis_number)
)

CREATE OR REPLACE VIEW foo.ao_ms AS 
 SELECT ao.ao_id, 
    ao.ao_plz, 
    ao.ao_community, 
    ao.ao_street, 
    ao.ao_last_update, 
    ao.ao_ms_id, 
    ao.ao_status, 
    ao.ao_has_eear, 
    ms.ms_nis_number, 
    ms.ms_plz, 
    ms.ms_community, 
    ms.ms_street, 
    ms.ms_status, 
    ms.ms_coord_x, 
    ms.ms_coord_y, 
    ms.ms_ao_id
   FROM foo.ao
   FULL JOIN foo.ms ON ao.ao_id = ms.ms_ao_id;

Queries on the view are quite fast – until I check specific fields for is null, like:

select count(*) from foo.ao_ms where ao_ms_id is null

In fact, the query never comes to an end, and I have to kill the Postgres process because it consumes my CPU and memory and never releases resources.

This is the query plan:

Aggregate  (cost=15699.31..15699.32 rows=1 width=0)
  ->  Hash Full Join  (cost=6467.13..15697.24 rows=829 width=0)
        Hash Cond: (ao.ao_id = ms.ms_ao_id)
        Filter: (ao.ao_ms_id IS NULL)
        ->  Seq Scan on ao  (cost=0.00..3664.65 rows=162465 width=40)
        ->  Hash  (cost=3747.39..3747.39 rows=165739 width=8)
              ->  Seq Scan on ms  (cost=0.00..3747.39 rows=165739 width=8)

The fun fact is; two of my friends have imported my database. On one machine it was exactly the same behavior, but on the third machine, execution completed within 200ms. There, the query plan looks like this:

Aggregate  (cost=15916.96..15916.97 rows=1 width=0)"
  ->  Hash Full Join  (cost=6647.46..15751.46 rows=66202 width=0)
        Hash Cond: (ms.ms_ao_id = ao.ao_id)
        Filter: (ao.ao_ms_id IS NULL)
        ->  Seq Scan on ms  (cost=0.00..3748.39 rows=165739 width=8)
        ->  Hash  (cost=3664.65..3664.65 rows=162465 width=17)
              ->  Seq Scan on ao  (cost=0.00..3664.65 rows=162465 width=17)

I re-wrote my view then to change order of the two tables for the join:

...
FROM foo.ms
   FULL JOIN foo.ao ON ms.ms_ao_id = ao.ao_id;

but this had no effect. I already implemented some indices, changed data types of text fields to character varying, and used select count(1) instead of (*), but performance still sucks. By the way, the ao_ms_id field has values like XYZ1234567.

We all have PostgreSQL 9.3 installed, so why only on one machine is the better query plan executed, whereas on the other two machines an inefficient plan is executed? And how can I force Postgres to use the faster query plan? Or did I setup the view the wrong way?

Edit: the following statements are done within a few milliseconds, therefore I think it has something to do with the is null statement (not null doesn't cause the problem):

select count(*) from foo.ao_ms where ao_ms_id is not null
select count(*) from foo.ao_ms where ao_ms_id like ''

Update:
The problem was first fixed by upgrading Postgres to version 9.4, but now appears again. Still the same: on one machine (productive system), the query leads to hanging processes, whereas on other machines it works like a charm.

When I SET enable_seqscan = OFF, then all the problematic queries on the view are working as fast as on my local machine. The query plan looks like this with seqscan "disabled" when I do the following query (which leads to the hanging process on the problematic machine):

EXPLAIN ANALYZE
SELECT count(*) FROM foo.ao_ms 
   WHERE ( CAST(ao_ms_id AS TEXT) ilike '' or ao_ms_id is null )

Query Plan:

Aggregate  (cost=24927.24..24927.25 rows=1 width=0) (actual time=220.543..220.543 rows=1 loops=1)
  ->  Merge Full Join  (cost=0.84..24801.33 rows=50366 width=0) (actual time=0.030..214.889 rows=100919 loops=1)
        Merge Cond: (ao.ao_id = ms.ms_ao_id)
        Filter: (((ao.ao_ms_id)::text ~~* ''::text) OR (ao.ao_ms_id IS NULL))
        Rows Removed by Filter: 117809
        ->  Index Scan using idx_ao_id on ao  (cost=0.42..9559.38 rows=166434 width=17) (actual time=0.016..36.979 rows=166434 loops=1)
        ->  Index Only Scan using fki_ao_id on ms  (cost=0.42..12951.07 rows=170114 width=8) (actual time=0.011..48.784 rows=170114 loops=1)
              Heap Fetches: 170114

I know, as described on the Postgres documentation, this should not be a permanent solution. So I am going to check other options too, like default_statistics_target and playing with planner cost values.

Has anyone any hints, what values could be set higher/lower/on/off to enforce better planer behaviour in this case?

Feedback from comments:

  • There is no, or just a really small, difference between the expected and actual row counts
  • The full join is needed, or at least it makes it a lot easier for client programming, where I show two grids that should be "lockable" on records that match the join.
  • I created btree indices on every column to get better performance, but it did not help. I also altered the columns from character varying to text, with the same result.
  • I don't understand why the seq scan on ao shows width=17 on my friend's analyze output, but width=40 for me.
  • One table is 28MB on disk, the other is 29MB. This value doesn't change after a vacuum.

Best Answer

This is more a hunch as I don't know the optimizer details and possible rewritings but FULL JOIN is quite restricting.

So, you could rewrite the view by splitting the FULL JOIN to a UNION ALL of 2 joins:

SELECT ... 
FROM ao LEFT JOIN ms 
    ON ao.ao_id = ms.ms_ao_id 

UNION ALL 

SELECT ... 
FROM ao RIGHT JOIN ms 
    ON ao.ao_id = ms.ms_ao_id 
WHERE ao.ao_id IS NULL ;

This might help the optimizer to identify that for the condition (where ao_ms_id is null) the nulls can be produced in 2 different ways, one from the table itself and the other as a by-product of the outer join. So the condition can be pushed down for the first part of the union and removed completely for the second part.

If that doesn't work, you could "simplify" the count(*) query yourself:

SELECT 
  ( SELECT COUNT(*)
    FROM ao LEFT JOIN ms 
        ON ao.ao_id = ms.ms_ao_id 
    WHERE ao.ao_ms_id IS NULL
  )
    +
  ( SELECT COUNT(*)
    FROM ao RIGHT JOIN ms 
        ON ao.ao_id = ms.ms_ao_id 
    WHERE ao.ao_id IS NULL 
  --  AND ao.ao_ms_id IS NULL           -- removed
  ) ;