Postgresql – Why does a list of 10,000 IDs perform better than using the equivalent SQL to select them

performancepostgresqlpostgresql-performancequery-performance

I have a Rails application with a legacy query that I'd like to refurbish. The current implementation performs two SQL queries: one to get a large number of IDs and a second query that uses those IDs and applies some additional joins and filters to get the desired result.

I am trying to replace this with a single query that avoids the round trip, but doing so has incurred a large performance degradation in my local testing environment (which is a copy of the full production dataset). It appears that an index is not being used in the new query, leading to a full table scan. I had hoped the single query would keep the same performance as the original code, ideally improving on it due to not needing to send all the IDs around.

This is a fairly minimized version of my actual problem. A slightly larger version is discussed at Why does a list of 10,000 IDs perform better in a complicated query with multiple CTEs compared to the equivalent SQL to select them?.

Current Query

There's a query that takes ~6.5 seconds to calculate a list of 10000+ IDs. You can see that as the CTE visible_projects in the "proposed query" section below. Those IDs are then fed into this query:

EXPLAIN (ANALYZE, BUFFERS)
WITH visible_projects AS NOT MATERIALIZED (
    SELECT
        id
    FROM
        "projects"
    WHERE
        "projects"."id" IN (
            -- 10000+ IDs removed
)),
visible_tasks AS MATERIALIZED (
    SELECT
        tasks.id
    FROM
        tasks
    WHERE
        tasks.project_id IN (
            SELECT
                id
            FROM
                visible_projects))
SELECT
    COUNT(1)
FROM
    visible_tasks;

Query plan (depesz)

Aggregate  (cost=1309912.31..1309912.32 rows=1 width=8) (actual time=148.661..153.739 rows=1 loops=1)
   Buffers: shared hit=73107 read=22301
   CTE visible_tasks
     ->  Gather  (cost=43024.54..1308639.80 rows=56556 width=4) (actual time=46.337..137.260 rows=48557 loops=1)
           Workers Planned: 2
           Workers Launched: 2
           Buffers: shared hit=73107 read=22301
           ->  Nested Loop  (cost=42024.54..1301984.20 rows=23565 width=4) (actual time=28.871..120.682 rows=16186 loops=3)
                 Buffers: shared hit=73107 read=22301
                 ->  Parallel Bitmap Heap Scan on projects  (cost=42023.97..138877.16 rows=4378 width=4) (actual time=28.621..52.627 rows=3502 loops=3)
                       Recheck Cond: (id = ANY ('{ REMOVED_IDS }'::integer[]))
                       Heap Blocks: exact=3536
                       Buffers: shared hit=30410 read=9833
                       ->  Bitmap Index Scan on projects_pkey  (cost=0.00..42021.35 rows=10507 width=0) (actual time=35.642..35.642 rows=10507 loops=1)
                             Index Cond: (id = ANY ('{ REMOVED_IDS }'::integer[]))
                             Buffers: shared hit=30410 read=1111
                 ->  Index Scan using test_tasks_on_project on tasks  (cost=0.57..263.85 rows=182 width=8) (actual time=0.012..0.018 rows=5 loops=10507)
                       Index Cond: (project_id = projects.id)
                       Buffers: shared hit=42697 read=12468
   ->  CTE Scan on visible_tasks  (cost=0.00..1131.12 rows=56556 width=0) (actual time=46.339..144.641 rows=48557 loops=1)
         Buffers: shared hit=73107 read=22301
 Planning:
   Buffers: shared hit=10 read=10
 Planning Time: 8.857 ms
 Execution Time: 156.102 ms

Proposed Query

This is the same query structure, but instead of inserting the 10000+ IDs directly into the visible_projects CTE, I've embedded the SQL that finds those IDs.

EXPLAIN (ANALYZE, BUFFERS)
WITH visible_projects AS NOT MATERIALIZED (
    SELECT
        id
    FROM
        "projects"
    WHERE
        "projects"."company_id" = 11171
        AND "projects"."state" < 6
        AND "projects"."is_template" = FALSE),
visible_tasks AS MATERIALIZED (
    SELECT
        tasks.id
    FROM
        tasks
    WHERE
        tasks.project_id IN (
            SELECT
                id
            FROM
                visible_projects))
SELECT
    COUNT(1)
FROM
    visible_tasks;

Query plan (depesz):

 Aggregate  (cost=2212223.53..2212223.54 rows=1 width=8) (actual time=40675.984..40686.708 rows=1 loops=1)
   Buffers: shared hit=118145 read=1567727
   CTE visible_tasks
     ->  Gather  (cost=279353.08..2208430.12 rows=168596 width=4) (actual time=7050.894..40666.025 rows=48557 loops=1)
           Workers Planned: 2
           Workers Launched: 2
           Buffers: shared hit=118145 read=1567727
           ->  Hash Join  (cost=278353.08..2190570.52 rows=70248 width=4) (actual time=7038.932..40650.430 rows=16186 loops=3)
                 Hash Cond: (tasks.project_id = projects.id)
                 Buffers: shared hit=118145 read=1567727
                 ->  Parallel Seq Scan on tasks  (cost=0.00..1828314.43 rows=31963043 width=8) (actual time=0.397..29372.029 rows=25572144 loops=3)
                       Buffers: shared read=1508684
                 ->  Hash  (cost=277961.56..277961.56 rows=31322 width=4) (actual time=6977.480..6977.481 rows=10507 loops=3)
                       Buckets: 32768  Batches: 1  Memory Usage: 626kB
                       Buffers: shared hit=118061 read=59031
                       ->  Index Scan using index_projects_on_company_id on projects  (cost=0.43..277961.56 rows=31322 width=4) (actual time=0.591..6970.696 rows=10507 loops=3)
                             Index Cond: (company_id = 11171)
                             Filter: ((NOT is_template) AND (state < 6))
                             Rows Removed by Filter: 63512
                             Buffers: shared hit=118061 read=59031
   ->  CTE Scan on visible_tasks  (cost=0.00..3371.92 rows=168596 width=0) (actual time=7050.896..40671.054 rows=48557 loops=1)
         Buffers: shared hit=118145 read=1567727
 Planning:
   Buffers: shared hit=2 read=18
 Planning Time: 9.528 ms
 Execution Time: 40687.524 ms

Even accounting for the two previous queries combined, this takes 6x the time as the current implementation.

I see that this has chosen to use Parallel Seq Scan on tasks which is the main contributing time factor. What I don't understand is why this was chosen and what I should do to return to using the index.

Through research, I've learned that Postgres doesn't offer query hints to force the use of an index, so I assume that a good solution will involve demonstrating to the query planner that using the index would be beneficial.

Schema

                                                 Table "public.projects"
           Column           |             Type              | Collation | Nullable |               Default
----------------------------+-------------------------------+-----------+----------+--------------------------------------
 id                         | integer                       |           | not null | nextval('projects_id_seq'::regclass)
 name                       | character varying(255)        |           |          |
 description                | text                          |           |          |
 due                        | timestamp without time zone   |           |          |
 created_at                 | timestamp without time zone   |           | not null |
 updated_at                 | timestamp without time zone   |           | not null |
 client_id                  | integer                       |           |          |
 company_id                 | integer                       |           |          |
 repeat                     | boolean                       |           | not null | true
 end_date                   | timestamp without time zone   |           |          |
 prev_id                    | integer                       |           |          |
 next_id                    | integer                       |           |          |
 completed_tasks_count      | integer                       |           | not null | 0
 tasks_count                | integer                       |           | not null | 0
 done_at                    | timestamp without time zone   |           |          |
 state                      | integer                       |           |          |
 schedule                   | text                          |           |          |
 start_date                 | timestamp without time zone   |           |          |
 manager_id                 | integer                       |           |          |
 partner_id                 | integer                       |           |          |
 exschedule                 | text                          |           |          |
 extdue                     | timestamp without time zone   |           |          |
 is_template                | boolean                       |           | not null | false
 predicted_duration         | integer                       |           |          | 0
 budget                     | integer                       |           |          | 0
 cached_effective_due_date  | timestamp without time zone   |           |          |
 cached_manager_fullname    | character varying(255)        |           |          | ''::character varying
 cached_partner_fullname    | character varying(255)        |           |          | ''::character varying
 cached_staffs_fullnames    | text                          |           |          | ''::text
 cached_staffs_ids          | text                          |           |          | ''::text
 cached_label_ids           | character varying(255)        |           |          | ''::character varying
 date_in                    | timestamp without time zone   |           |          |
 cached_label_sum           | integer                       |           |          | 0
 date_out                   | timestamp without time zone   |           |          |
 turn_around_time           | integer                       |           |          | 0
 dues_calculated_at         | timestamp without time zone   |           |          |
 dues                       | timestamp without time zone[] |           |          |
 dues_rewind                | integer[]                     |           |          |
 quickbooks_item_id         | integer                       |           |          |
 perform_final_review       | boolean                       |           | not null | false
 quickbooks_desktop_item_id | integer                       |           |          |
 billing_model_type         | character varying             |           | not null | 'staff'::character varying
 series_id                  | integer                       |           |          |
 shared                     | boolean                       |           |          | false
Indexes:
    "projects_pkey" PRIMARY KEY, btree (id)
    "index_projects_on_cached_effective_due_date" btree (cached_effective_due_date)
    "index_projects_on_client_id" btree (client_id)
    "index_projects_on_company_id" btree (company_id)
    "index_projects_on_manager_id" btree (manager_id)
    "index_projects_on_next_id" btree (next_id)
    "index_projects_on_partner_id" btree (partner_id)
    "index_projects_on_series_id" btree (series_id)
    "index_projects_on_shared_and_is_template" btree (shared, is_template) WHERE shared = true AND is_template = true
Foreign-key constraints:
    "fk_rails_243d23cb48" FOREIGN KEY (quickbooks_desktop_item_id) REFERENCES quickbooks_desktop_items(id)
    "fk_rails_33ba8711de" FOREIGN KEY (quickbooks_item_id) REFERENCES quickbooks_items(id)
    "fk_rails_fcf0ca7614" FOREIGN KEY (series_id) REFERENCES series(id) NOT VALID
Referenced by:
    TABLE "tasks" CONSTRAINT "tasks_project_id_fkey" FOREIGN KEY (project_id) REFERENCES projects(id)

The projects table has 14,273,833 rows.

124,005 are is_template = true

                                               Table "public.tasks"
         Column          |            Type             | Collation | Nullable |              Default
-------------------------+-----------------------------+-----------+----------+-----------------------------------
 id                      | integer                     |           | not null | nextval('tasks_id_seq'::regclass)
 name                    | character varying(255)      |           |          |
 description             | text                        |           |          |
 duedate                 | timestamp without time zone |           |          |
 created_at              | timestamp without time zone |           | not null |
 updated_at              | timestamp without time zone |           | not null |
 project_id              | integer                     |           | not null |
 done                    | boolean                     |           | not null | false
 position                | integer                     |           |          |
 done_at                 | timestamp without time zone |           |          |
 dueafter                | integer                     |           |          |
 done_by_user_id         | integer                     |           |          |
 predicted_duration      | integer                     |           |          |
 auto_predicted_duration | integer                     |           |          | 0
 assignable_id           | integer                     |           |          |
 assignable_type         | character varying           |           |          |
 will_assign_to_client   | boolean                     |           | not null | false
Indexes:
    "tasks_pkey" PRIMARY KEY, btree (id)
    "index_tasks_on_assignable_type_and_assignable_id" btree (assignable_type, assignable_id)
    "index_tasks_on_done_by_user_id" btree (done_by_user_id)
    "index_tasks_on_duedate" btree (duedate)
    "test_tasks_on_project" btree (project_id)
Foreign-key constraints:
    "tasks_project_id_fkey" FOREIGN KEY (project_id) REFERENCES projects(id)

The tasks table has 76,716,433 rows.

System specifications

PostgreSQL 13.1
2.9 GHz 6-Core Intel Core i9
32 GB RAM
macOS 10.15.7

Best Answer

The main reason for the different query plan is probably the increased number of rows that Postgres estimates to get back from projects:

(cost=0.00..42021.35 rows=10507 width=0) (actual time=35.642..35.642 rows=10507 loops=1)

vs.

(cost=0.43..277961.56 rows=31322 width=4) (actual time=0.591..6970.696 rows=10507 loops=3)

Over-estimated by factor 3, which is not dramatic, but obviously enough to favor a different (inferior) query plan. Related:

Postgres sometimes uses inferior index for WHERE a IN (...) ORDER BY b LIMIT N

Assuming projects.is_template is mostly false, I suggest these multicolumn indices:

CREATE INDEX ON projects(company_id, state);

Equality first, range later. See:

Multicolumn index and performance

You might also try to increase the statistics target on company_id, state, and ANALYZE the table, to get better estimates.

And:

CREATE INDEX ON tasks (project_id, id);

Plus increase statistics target on tasks.project_id and ANALYZE.

In both cases, the multicolumn index can replace the one on just project.company_id / task.project_id. Since all columns are integer, the size of the index ~~will~~ would be the same - except for the effect of index de-duplication (added with Postgres 13), which shows strongly in your test for the highly duplicative tasks.project_id. See:

Is a composite index also good for queries on the first field?

And this query:

SELECT t.id
FROM   projects p
JOIN   tasks t ON t.project_id = p.id
WHERE  p.company_id = 11171
AND    p.state < 6
AND    p.is_template = FALSE;

The direct join should be faster.

Postgresql – Why does a list of 10,000 IDs perform better than using the equivalent SQL to select them

Current Query

Proposed Query

Meta

Schema

System specifications

Best Answer

Related Question

Current Query

Proposed Query

Meta

Schema

System specifications

Best Answer

Related Solutions

Postgresql – Optimizing ORDER BY in a full text search query

Postgresql – How to speed up a Postgres query containing lots of Joins with an ILIKE condition

Related Question