PostgreSQL Performance – Use Nested Loop with Indices Over Hash Join

execution-planperformancepostgresqlquery-performance

I'm having a problem with some StackOverflow-schema related data loaded into PostgreSQL 9.3.4. I have a query that is taking about 10x longer than it should, due to the fact that it is choosing to use a hash join instead of a nested loop w/ indices. For example, if I select 500 users in the query, a hash join is used instead of using the id and type indices on the post_tokenized table:

explain 
select creation_epoch, user_screen_name, chunk from post_tokenized as tokenized_tbl
                    join posts as posts_tbl
                    on posts_tbl.id = tokenized_tbl.id
                    where type = 'tag'
                    and user_screen_name is not null
                    and owner_user_id in (select id from users where reputation > 100000 order by reputation asc limit 500)
                    and tokenized_tbl.id in (select id from posts where owner_user_id in (select id from users where reputation > 100000 order by reputation asc limit 500))

 Hash Join  (cost=29570.13..751852.55 rows=119954 width=21)
   Hash Cond: (tokenized_tbl.id = posts_tbl.id)
   ->  Index Scan using type_index_post_tokenized on post_tokenized tokenized_tbl  (cost=0.44..646219.29 rows=20281711 width=8)
         Index Cond: (type = 'tag'::text)
   ->  Hash  (cost=29561.73..29561.73 rows=637 width=25)
         ->  Hash Join  (cost=15576.75..29561.73 rows=637 width=25)
               Hash Cond: (posts_tbl.id = posts.id)
               ->  Nested Loop  (cost=48.20..12824.71 rows=106853 width=21)
                     ->  HashAggregate  (cost=47.76..52.76 rows=500 width=4)
                           ->  Limit  (cost=0.43..41.51 rows=500 width=8)
                                 ->  Index Scan using reputation_index_users on users  (cost=0.43..211.57 rows=2570 width=8)
                                       Index Cond: (reputation > 100000)
                     ->  Index Scan using owner_user_id_index_posts on posts posts_tbl  (cost=0.44..23.40 rows=214 width=25)
                           Index Cond: (owner_user_id = users.id)
                           Filter: (user_screen_name IS NOT NULL)
               ->  Hash  (cost=14181.63..14181.63 rows=107754 width=4)
                     ->  HashAggregate  (cost=13104.09..14181.63 rows=107754 width=4)
                           ->  Nested Loop  (cost=48.20..12834.71 rows=107754 width=4)
                                 ->  HashAggregate  (cost=47.76..52.76 rows=500 width=4)
                                       ->  Limit  (cost=0.43..41.51 rows=500 width=8)
                                             ->  Index Scan using reputation_index_users on users users_1  (cost=0.43..211.57 rows=2570 width=8)
                                                   Index Cond: (reputation > 100000)
                                 ->  Index Scan using owner_user_id_index_posts on posts  (cost=0.44..23.40 rows=216 width=8)
                                       Index Cond: (owner_user_id = users_1.id)

But if I decrease the number of users to 200, then a nested loop with indices is used (much faster):

explain 
select creation_epoch, user_screen_name, chunk from post_tokenized as tokenized_tbl
                    join posts as posts_tbl
                    on posts_tbl.id = tokenized_tbl.id
                    where type = 'tag'
                    and user_screen_name is not null
                    and owner_user_id in (select id from users where reputation > 100000 order by reputation asc limit 200)
                    and tokenized_tbl.id in (select id from posts where owner_user_id in (select id from users where reputation > 100000 order by reputation asc limit 200))

 Nested Loop  (cost=6633.63..466114.15 rows=47982 width=21)
   ->  Hash Join  (cost=6291.07..11836.00 rows=102 width=25)
         Hash Cond: (posts_tbl.id = posts.id)
         ->  Nested Loop  (cost=19.80..5189.72 rows=42741 width=21)
               ->  HashAggregate  (cost=19.36..21.36 rows=200 width=4)
                     ->  Limit  (cost=0.43..16.86 rows=200 width=8)
                           ->  Index Scan using reputation_index_users on users  (cost=0.43..211.57 rows=2570 width=8)
                                 Index Cond: (reputation > 100000)
               ->  Index Scan using owner_user_id_index_posts on posts posts_tbl  (cost=0.44..23.70 rows=214 width=25)
                     Index Cond: (owner_user_id = users.id)
                     Filter: (user_screen_name IS NOT NULL)
         ->  Hash  (cost=5732.50..5732.50 rows=43102 width=4)
               ->  HashAggregate  (cost=5301.48..5732.50 rows=43102 width=4)
                     ->  Nested Loop  (cost=19.80..5193.72 rows=43102 width=4)
                           ->  HashAggregate  (cost=19.36..21.36 rows=200 width=4)
                                 ->  Limit  (cost=0.43..16.86 rows=200 width=8)
                                       ->  Index Scan using reputation_index_users on users users_1  (cost=0.43..211.57 rows=2570 width=8)
                                             Index Cond: (reputation > 100000)
                           ->  Index Scan using owner_user_id_index_posts on posts  (cost=0.44..23.70 rows=216 width=8)
                                 Index Cond: (owner_user_id = users_1.id)
   ->  Bitmap Heap Scan on post_tokenized tokenized_tbl  (cost=342.56..4448.69 rows=502 width=8)
         Recheck Cond: (id = posts_tbl.id)
         Filter: (type = 'tag'::text)
         ->  Bitmap Index Scan on id_index_post_tokenized  (cost=0.00..342.44 rows=43656 width=0)
               Index Cond: (id = posts_tbl.id)

How can I get the same plan (nested loop with indices) to be used when 500 users are selected? I've tried adjusting the following parameters: cpu_tuple_cost, seq_page_cost, random_page_cost, effective_cache_size, (ref) and I can't figure out how to change the plan. It seems that the plan changes as the number of users requested is increased, but from testing in my environment, it would be much faster if Postgres kept the same plan even at 500 users.

Best Answer

This closely related answer on SO should provide answers to your primary question:
Setting enable_seqscan = off in a single SELECT query

You could use in similar fashion, to disable hash joins for the current transaction:

SET LOCAL enable_hashjoin=off;

But that's not my advice. Read the answer over there.
And this one about statistics and cost settings, too.

More importantly, untangle your query first:

SELECT creation_epoch, user_screen_name, chunk
FROM  (
   SELECT id AS owner_user_id
   FROM   users
   WHERE  reputation > 100000
   ORDER  BY reputation 
   LIMIT  500
   ) u
JOIN   posts p USING (owner_user_id)
JOIN   post_tokenized t USING (id)
WHERE  type = 'tag'
AND    user_screen_name IS NOT NULL;

Should be considerably faster and also make it easier for the query planner to choose the best plan (given sane cost settings and table statistics).