Postgresql – Optimize query with multiple IN statements

index-tuningoptimizationpostgresqlpostgresql-9.5

So I have a large query which I have been trying to optimize as over time it has gotten fairly slow (~116k records taking a few seconds to get a result). The context is a social network feed where the user is "authorized" to see the posts. Each post has an associated tag (which can be a whole location, a classroom or a student. The slow part in question looks like this:

WHERE (
  "post"."deletedAt" IS NULL AND (
    EXISTS (
      SELECT 1
        FROM "tags"
       WHERE "tags"."postId" = "post"."id" AND (
         "tags"."classroomId" IN (classroom ids) OR
         "tags"."locationId" IN (location ids) OR
         "tags"."studentId" IN (student ids)
       )
    )
  )
)

When I run EXPLAIN ANALYZE on the query I get this…

->  HashAggregate  (cost=4636.49..4637.35 rows=86 width=4) (actual time=23.245..23.269 rows=47 loops=1)"                                                                                                            
    Group Key: tags_1."postId"                                                                                                              
    ->  Seq Scan on tags tags_1  (cost=0.00..4636.22 rows=107 width=4) (actual time=1.147..23.214 rows=47 loops=1)                                                                                                                    
        Filter: (("classroomId" = 11) OR ("locationId" = 27) OR ("studentId" = 29))                                                                                                                   
        Rows Removed by Filter: 136955

It would be ideal to be able to somehow index these fields so that it would be faster to get all posts for specific users and places. Does the community here have any recommendations for how I would be able to speed this section of query up?

I managed, so far, to make the query about 50% faster by indexing on the comments and likes but this is still a large bottleneck for me. I have tried a few indices to help this but it hasn't really helped. Maybe indices aren't the problem, is there a more efficient way to do multiple IN queries that I'm not aware of?

I appreciate any feedback anyone can give! This is Postgres 9.5.2 by the way.

Best Answer

Possible optimisations

There are a few tricks you can employ:

  1. Use a partial index to try to skip out all the deleted_at rows.
  2. Use a few strategic composite indexes on tags to look for specific classrooms, students or locations.
  3. Try to have those indexes to be covering indexes
  4. Make sure that PostgreSQL has got all the information about row visibility (i.e. VACUUM ANALYZE) so thtat PostgreSQL tries to use index only scans.
  5. Have the query avoid ORs in conditions (they tend to be very difficult to optimize). Use UNION instead.

Practical approach

This is how I have tried to mimick your scenario on DBFiddle (I have slightly changed your column naming and followed a more "PostgreSQL-standardish" approach):

First, I create a simplified version of your tables, and fill it with an amount of data similar to the one you state

 -- Simplified version of posts table...
 CREATE TABLE posts
 (
     post_id integer primary key,
     deleted_at timestamp without time zone default NULL
 ) ;

 -- Simplified version of tags table...
 CREATE TABLE tags 
 (
     tag_id integer primary key,
     post_id integer NOT NULL REFERENCES posts(post_id),

     classroom_id integer,
     location_id integer,
     student_id integer
 ) ;

 -- Insert 116K posts
 INSERT INTO posts (post_id, deleted_at) 
 SELECT
    generate_series(1, 116001),
    case when random() < 0.25 then date '2017-01-01' else NULL end AS deleted_at ;

 -- Insert 136K tags
 INSERT INTO tags(tag_id, post_id, classroom_id, location_id, student_id)
 SELECT
      generate_series(1, 136000) AS tag_id, 
      random()*116000 +1 AS post_id, 
      random()*100 +1 AS clasroom_id, 
      random()*100 +1 AS location_id, 
      random()*100 +1 AS student_id ;

At this point, I execute a query taht tries to mimick the provided WHERE condition, and have PostgreSQL explain it:

 -- Explain how long does it take to use the "reference query"
 EXPLAIN ANALYZE 
 SELECT
     post_id
 FROM
     posts
 WHERE
     posts.deleted_at IS NULL AND 
     (
     EXISTS 
       (
       SELECT 1
         FROM tags
        WHERE tags.post_id = posts.post_id  AND 
          (
          tags.classroom_id IN (1, 20, 35) OR
          tags.location_id IN (32, 40, 23) OR
          tags.student_id IN (20, 30, 40)
          )
       )
     ) ;

 -- DBFiddle (at this moment, and under current load conditions) takes about 70 to 120 ms to perform this query

 | QUERY PLAN                                                                                                                                                             |
 | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | Nested Loop  (cost=4015.66..5369.37 rows=276 width=4) (actual time=41.336..72.243 rows=8094 loops=1)                                                                   |
 |   ->  HashAggregate  (cost=4015.37..4017.37 rows=200 width=4) (actual time=41.320..44.876 rows=10787 loops=1)                                                          |
 |         Group Key: tags.post_id                                                                                                                                        |
 |         ->  Seq Scan on tags  (cost=0.00..3999.04 rows=6534 width=4) (actual time=0.035..37.036 rows=11284 loops=1)                                                    |
 |               Filter: ((classroom_id = ANY ('{1,20,35}'::integer[])) OR (location_id = ANY ('{32,40,23}'::integer[])) OR (student_id = ANY ('{20,30,40}'::integer[]))) |
 |               Rows Removed by Filter: 124716                                                                                                                           |
 |   ->  Index Scan using posts_pkey on posts  (cost=0.29..6.75 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=10787)                                             |
 |         Index Cond: (post_id = tags.post_id)                                                                                                                           |
 |         Filter: (deleted_at IS NULL)                                                                                                                                   |
 |         Rows Removed by Filter: 0                                                                                                                                      |
 | Planning time: 0.353 ms                                                                                                                                                |
 | Execution time: 73.526 ms                                                                                                                                              |
 

To use the tricks explained before:

 -- We create a few indexes
 -- One *conditional* index to clear out "deleted_at" rows in posts
 CREATE INDEX idx_posts_alive ON posts(post_id) WHERE deleted_at IS NULL ;

 -- Compound indexes for classroom, location and student + post_id in the tags table
 CREATE INDEX idx_tags_classroom_id ON tags(classroom_id, post_id) ;
 CREATE INDEX idx_tags_location_id ON tags(location_id, post_id) ;
 CREATE INDEX idx_tags_student_id ON tags(student_id, post_id) ;

and make statistics up-to-date for PostgreSQL:

 -- Update all table information to allow for index-only-scans
 VACUUM ANALYZE tags;

 -- (id. just in case, for table posts)
 VACUUM ANALYZE posts;

And now, we

-- An alternate version of the query, that focus first on tags (using index-only scans) -- It avoids ORing conditions (which tend to be difficult to optimize) -- and uses, instead, UNIONs

 EXPLAIN ANALYZE 
 SELECT
     posts.post_id
 FROM
     tags 
     JOIN posts ON posts.deleted_at IS NULL and posts.post_id = tags.post_id
 WHERE
     tags.classroom_id IN (1, 20, 35)
 UNION
 SELECT
     posts.post_id
 FROM
     tags 
     JOIN posts ON posts.deleted_at IS NULL and posts.post_id = tags.post_id
 WHERE
     tags.location_id IN (32, 40, 23) 
 UNION
 SELECT
     posts.post_id
 FROM
     tags 
     JOIN posts ON posts.deleted_at IS NULL and posts.post_id = tags.post_id
 WHERE
     tags.student_id IN (20, 30, 40) ;

 -- In DBFiddle, as of now, this provides about 31 ms exeuction time (about 3x better than the original query)

 | QUERY PLAN                                                                                                                                                      |
 | :-------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | HashAggregate  (cost=7097.27..7185.50 rows=8823 width=4) (actual time=34.253..36.499 rows=8094 loops=1)                                                         |
 |   Group Key: posts.post_id                                                                                                                                      |
 |   ->  Append  (cost=0.71..7075.22 rows=8823 width=4) (actual time=0.044..30.810 rows=8688 loops=1)                                                              |
 |         ->  Nested Loop  (cost=0.71..2335.97 rows=2956 width=4) (actual time=0.044..8.648 rows=2562 loops=1)                                                    |
 |               ->  Index Only Scan using idx_tags_classroom_id on tags  (cost=0.42..118.05 rows=3931 width=4) (actual time=0.030..0.876 rows=3391 loops=1)       |
 |                     Index Cond: (classroom_id = ANY ('{1,20,35}'::integer[]))                                                                                   |
 |                     Heap Fetches: 0                                                                                                                             |
 |               ->  Index Only Scan using idx_posts_alive on posts  (cost=0.29..0.55 rows=1 width=4) (actual time=0.001..0.002 rows=1 loops=3391)                 |
 |                     Index Cond: (post_id = tags.post_id)                                                                                                        |
 |                     Heap Fetches: 0                                                                                                                             |
 |         ->  Nested Loop  (cost=0.71..2354.87 rows=2999 width=4) (actual time=0.023..9.642 rows=3037 loops=1)                                                    |
 |               ->  Index Only Scan using idx_tags_location_id on tags tags_1  (cost=0.42..119.03 rows=3987 width=4) (actual time=0.018..1.090 rows=4076 loops=1) |
 |                     Index Cond: (location_id = ANY ('{32,40,23}'::integer[]))                                                                                   |
 |                     Heap Fetches: 0                                                                                                                             |
 |               ->  Index Only Scan using idx_posts_alive on posts posts_1  (cost=0.29..0.55 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=4076)         |
 |                     Index Cond: (post_id = tags_1.post_id)                                                                                                      |
 |                     Heap Fetches: 0                                                                                                                             |
 |         ->  Nested Loop  (cost=0.71..2296.15 rows=2868 width=4) (actual time=0.023..9.715 rows=3089 loops=1)                                                    |
 |               ->  Index Only Scan using idx_tags_student_id on tags tags_2  (cost=0.42..115.99 rows=3813 width=4) (actual time=0.018..1.159 rows=4117 loops=1)  |
 |                     Index Cond: (student_id = ANY ('{20,30,40}'::integer[]))                                                                                    |
 |                     Heap Fetches: 0                                                                                                                             |
 |               ->  Index Only Scan using idx_posts_alive on posts posts_2  (cost=0.29..0.56 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=4117)         |
 |                     Index Cond: (post_id = tags_2.post_id)                                                                                                      |
 |                     Heap Fetches: 0                                                                                                                             |
 | Planning time: 0.667 ms                                                                                                                                         |
 | Execution time: 37.769 ms                                                                                                                                       |
 

The end result is a 3x speed increase when doing SELECT. Obviously, this will need updating depending a lot on your specific case (I've made a very simplified version to show the tricks). Obviously, this comes at the cost of maintaining a certain number of indexes, and increasing INSERT and UPDATE times. You hae to analyze, for your specific case, if this is an overall improvement or an overall worsening..

dbfiddle here