Postgresql – Postgres 9.4.4 query takes forever

performancepostgresqlpostgresql-9.4query-performance

We're running Postgres 9.4.4 on CentOS 6.5 and have a SELECT query that has worked for years, but stopped working and hangs after we upgraded from 9.2 (it took a while to notice it, so I don't know if it was immediately after we upgraded or not).

SELECT id || ':' || group_number AS uniq_id
FROM   table_one
WHERE  id || ':' || group_number NOT IN (
   SELECT id || ':' || group_number
   FROM table_two
   )
AND    id NOT IN (
   SELECT id
   FROM table_three
   WHERE timestamp > NOW() - INTERVAL '30 days' 
   AND client_id > 0
   );

In all tables id is an integer, but is stored as character varying (15) (legacy system). group_number is stored as a smallint.

The sub-query for table_two returns about 2.5 million records. The sub-query for table_three returns about 2,500 records. Both return in about 1 second if run separately. But adding in either query (or both) as sub-queries causes the query to just hang indefinitely (for days, if we let it run).

I've seen others online with the same problem (query not returning when using NOT IN). NOT IN seems like such a straight forward sub-query.

We have plenty of hardware (384 GB RAM, Xeon 64 cores, 16 disk 15k RPM RAID 10).

  1. Why is this happening? (ie. is this a major ongoing bug in Postgres?)
  2. How can I fix / debug it in the meantime?

Here are the results of EXPLAIN:

QUERY PLAN
Index Only Scan using table_one_id_pk on table_one  (cost=19690.90..64045129699.10 rows=370064 width=9)
  Filter: ((NOT (hashed SubPlan 2)) AND (NOT (SubPlan 1)))
  SubPlan 2
    ->  Bitmap Heap Scan on table_three  (cost=2446.92..19686.74 rows=8159 width=7)
          Recheck Cond: (("timestamp" > (now() - '30 days'::interval)) AND (client_id > 0))
          ->  BitmapAnd  (cost=2446.92..2446.92 rows=8159 width=0)
                ->  Bitmap Index Scan on table_one_timestamp_idx  (cost=0.00..1040.00 rows=79941 width=0)
                      Index Cond: ("timestamp" > (now() - '30 days'::interval))
                ->  Bitmap Index Scan on fki_table_three_client_id  (cost=0.00..1406.05 rows=107978 width=0)
                      Index Cond: (client_id > 0)
  SubPlan 1
    ->  Materialize  (cost=0.00..84813.75 rows=3436959 width=9)
          ->  Seq Scan on table_two  (cost=0.00..64593.79 rows=3436959 width=9)

My settings from postgresql.conf

max_connections = 200
shared_buffers = 24GB
temp_buffers = 8MB
work_mem = 96MB
maintenance_work_mem = 1GB
cpu_tuple_cost = 0.0030
cpu_index_tuple_cost = 0.0010
cpu_operator_cost = 0.0005
effective_cache_size = 128GB
from_collapse_limit = 4
join_collapse_limit = 4

UPDATE

I used the following method to adjust the work_mem just for this query:

BEGIN;
SET work_mem = '256MB';
-- query --
SET work_mem = default;
COMMIT;

Using NOT IN returned in 5 – 8 seconds (vs never with work_mem = 96MB).

Using LEFT JOIN returned in 13 – 14 seconds (vs 24 seconds with work_mem = 96MB).

So it looks like the problem was with work_mem, and using LEFT JOIN was just a workaround. However, the real problem is Postgres going for days with work_mem = 96MB.

With 16 x 15k SAS drives in RAID 10 we have very fast I/O, so even going to disc the query should have returned, just a bit slower.

UPDATE 2

Here are the results for EXPLAIN ANALYZE on the LEFT JOIN approach:

    QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop Anti Join  (cost=27318.56..351160.97 rows=728325 width=9) (actual time=9553.378..21247.202 rows=7 loops=1)
   ->  Hash Anti Join  (cost=27318.47..176945.69 rows=1501249 width=9) (actual time=511.578..5479.549 rows=1478438 loops=1)
         Hash Cond: ((t1.id)::text = (t3.id)::text)
         ->  Seq Scan on table_one t1  (cost=0.00..143842.21 rows=1593403 width=9) (actual time=0.026..4369.804 rows=1485291 loops=1)
         ->  Hash  (cost=27289.76..27289.76 rows=8203 width=7) (actual time=511.518..511.518 rows=1286 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 51kB
               ->  Bitmap Heap Scan on table_three t3  (cost=1518.79..27289.76 rows=8203 width=7) (actual time=125.379..510.998 rows=1286 loops=1)
                     Recheck Cond: (client_id > 0)
                     Filter: ("timestamp" > (now() - '30 days'::interval))
                     Rows Removed by Filter: 104626
                     Heap Blocks: exact=16093
                     ->  Bitmap Index Scan on fki_table_three_client_id  (cost=0.00..1518.38 rows=108195 width=0) (actual time=121.633..121.633 rows=122976 loops=1)
                           Index Cond: (client_id > 0)
   ->  Index Only Scan using t_table_two_id2_idx on table_two t2  (cost=0.09..0.14 rows=1 width=9) (actual time=0.010..0.010 rows=1 loops=1478438)
         Index Cond: ((id = (t1.id)::text) AND (group_number = t1.group_number))
         Heap Fetches: 143348
 Planning time: 30.527 ms
 Execution time: 21247.541 ms
(18 rows)

Time: 23697.256 ms

And here they are for the NOT EXISTS approach:

    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop Anti Join  (cost=27318.56..351160.97 rows=728325 width=9) (actual time=5117.110..14061.838 rows=7 loops=1)
   ->  Hash Anti Join  (cost=27318.47..176945.69 rows=1501249 width=9) (actual time=146.779..1254.400 rows=1478439 loops=1)
         Hash Cond: ((t1.id)::text = (t3.id)::text)
         ->  Seq Scan on table_one t1  (cost=0.00..143842.21 rows=1593403 width=9) (actual time=0.007..591.383 rows=1485291 loops=1)
         ->  Hash  (cost=27289.76..27289.76 rows=8203 width=7) (actual time=146.758..146.758 rows=1285 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 51kB
               ->  Bitmap Heap Scan on table_three t3  (cost=1518.79..27289.76 rows=8203 width=7) (actual time=17.586..146.330 rows=1285 loops=1)
                     Recheck Cond: (client_id > 0)
                     Filter: ("timestamp" > (now() - '30 days'::interval))
                     Rows Removed by Filter: 104627
                     Heap Blocks: exact=16093
                     ->  Bitmap Index Scan on fki_table_one_client_id  (cost=0.00..1518.38 rows=108195 width=0) (actual time=14.415..14.415 rows=122976 loops=1)
                           Index Cond: (client_id > 0)
   ->  Index Only Scan using t_table_two_id2_idx on table_two t2  (cost=0.09..0.14 rows=1 width=9) (actual time=0.008..0.008 rows=1 loops=1478439)
         Index Cond: ((id = (t1.id)::text) AND (group_number = t1.group_number))
         Heap Fetches: 143348
 Planning time: 2.155 ms
 Execution time: 14062.014 ms
(18 rows)

Time: 14065.573 ms

Best Answer

Assuming you checked off usual suspects in the wiki page as commented by @a_horse.

Also see the spin-off addressing the discussion of bitmap index scans and the size of work_mem.

Query

This rewritten query should be substantially faster:

SELECT id || ':' || group_number AS uniq_id
    -- id::text || ':' || group_number AS uniq_id  -- with integer
FROM   table_one t1
WHERE  NOT EXISTS ( 
   SELECT 1
   FROM   table_two t2
   WHERE  t2.id = t1.id
   AND    t2.group_number = t1.group_number
   ) 
AND NOT EXISTS (
   SELECT 1
   FROM   table_three t3
   WHERE  t3.timestamp > NOW() - interval '30 days' 
   AND    t3.client_id > 0
   AND    t3.id = t1.id
   );
  • The most important problem is comparing a concatenated string between table_one and table_two, which is generally more expensive than necessary and specifically not sargable.

  • Storing integer numbers as strings is expensive nonsense. You seem to be aware of that. Convert to integer if at all possible. If you only have valid integer numbers in the varchar column id, all you need to do is:

    ALTER TABLE table_one ALTER COLUMN id TYPE integer USING id::int;
    

    And possibly the same for table_two.

  • NOT IN carries a trap for NULL values on either side. That's why NOT EXISTS is almost always better. (Typically performs better on top of that.)

Indexes

Either way, key to performance are matching indexes.

Be sure to have multicolumn index on table_one and table_two:

CREATE INDEX t1_foo_idx ON table_one (id, group_number)
CREATE INDEX t2_foo_idx ON table_two (id, group_number)

Possibly allows index-only scans.
With integer instead of varchar, these would be smaller and more efficient, yet:

And I suggest a partial multicolumn index on table_three:

CREATE INDEX t3_foo_idx ON table_three (timestamp, id)
WHERE  client_id > 0
AND    timestamp > '2015-06-07 0:0';

Usefulness deteriorates over time. Recreate the index with increased lower bound at opportune moments - which takes an exclusive lock on the table, so consider CREATE INDEX CONCURRENTLY. Detailed explanation:

You need to match the (updated) index condition in your queries. Add the condition even if that seems redundant. Like:

...
AND NOT EXISTS (
   SELECT 1
   FROM   table_three t3
   WHERE  t3.timestamp > NOW() - interval '30 days' 
   AND    t3 timestamp > '2015-06-07 0:0'  -- match index condition
   AND    t3.client_id > 0
   AND    t3.id = t1.id
   );

You can use a function as pseudo-constant in partial index and query and automate the process. Last chapter in this related answer:

SET LOCAL

Like you found yourself, increasing work_mem locally for the query helps if the query needs that much RAM. Consider SET LOCAL:

With all suggested improvements in place you may not need to increase work_mem any more.