Postgresql – Indexed varchar column very slow (40M+ rows)

indexpostgresqlpostgresql-performance

I've got an events table with over 40 million rows. It has a btree index on type: "index_events_on_type" btree (type).

Any equality look up takes ages! I tried tuning up the work_mem to get rid of the lossy value but it had little to no effect.

dev=# EXPLAIN (ANALYZE, BUFFERS) SELECT "events".* FROM "events" WHERE "events"."type" = 'contact.card.added';
                                                                   QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on events  (cost=56176.20..1633783.13 rows=1614405 width=296) (actual time=252.846..13832.331 rows=1550851 loops=1)
   Recheck Cond: ((type)::text = 'contact.card.added'::text)
   Rows Removed by Index Recheck: 10414231
   Heap Blocks: exact=39335 lossy=758338
   Buffers: shared read=807020
   ->  Bitmap Index Scan on index_events_on_type  (cost=0.00..55772.60 rows=1614405 width=0) (actual time=243.096..243.096 rows=1550851 loops=1)
         Index Cond: ((type)::text = 'contact.card.added'::text)
         Buffers: shared read=9347
 Planning time: 0.100 ms
 Execution time: 13924.612 ms
(10 rows)

…and the table

dev=# \d events
                                            Table "public.events"
      Column       |            Type             | Collation | Nullable |              Default
-------------------+-----------------------------+-----------+----------+------------------------------------
 id                | bigint                      |           | not null | nextval('events_id_seq'::regclass)
 type              | character varying           |           | not null |
 aggregate_root_id | uuid                        |           | not null |
 entity_id         | uuid                        |           |          |
 parent_id         | bigint                      |           |          |
 created_at        | timestamp without time zone |           | not null |
 body              | jsonb                       |           |          | '{}'::jsonb
Indexes:
    "events_pkey" PRIMARY KEY, btree (id)
    "index_events_on_aggregate_root_id" btree (aggregate_root_id)
    "index_events_on_created_at" btree (created_at)
    "index_events_on_entity_id" btree (entity_id)
    "index_events_on_parent_id" btree (parent_id)
    "index_events_on_type" btree (type)
Foreign-key constraints:
    "fk_rails_68f023eb25" FOREIGN KEY (parent_id) REFERENCES events(id)
Referenced by:
    TABLE "events" CONSTRAINT "fk_rails_68f023eb25" FOREIGN KEY (parent_id) REFERENCES events(id)

Any guidance would be appreciated.

Update

Looks like the most time spent is in I/O as the query returns 1.5M records:

dev=# EXPLAIN (ANALYZE, BUFFERS) SELECT "events"."type" FROM "events" WHERE "events"."type" = 'bank_account.payout.added';
                                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on events  (cost=6324.92..520827.82 rows=182239 width=32) (actual time=283.119..15719.744 rows=1550851 loops=1)
   Recheck Cond: ((type)::text = 'contact.card.added'::text)
   Rows Removed by Index Recheck: 10414212
   Heap Blocks: exact=39336 lossy=758337
   Buffers: shared hit=75 read=806945
   I/O Timings: read=13783.176
   ->  Bitmap Index Scan on index_events_on_type  (cost=0.00..6279.36 rows=182239 width=0) (actual time=272.302..272.302 rows=1550851 loops=1)
         Index Cond: ((type)::text = 'contact.card.added'::text)
         Buffers: shared hit=75 read=9272
         I/O Timings: read=67.302
 Planning time: 0.069 ms
 Execution time: 15807.701 ms
(12 rows)

Rows returned:

dev=# SELECT COUNT(*) FROM "events" WHERE "events"."type" = 'contact.card.added';
  count
---------
 1550851
(1 row)

Total rows:

dev=# SELECT COUNT(*) FROM "events"
dev-# ;
  count
----------
 36447779
(1 row)

Best Answer

Do you really need to select every column? If not, then don't. It might be generating extra TOAST table lookups, or might be inhibiting index-only scans.

You are selecting 1.5 million rows. Indexes are awesome, but they are not magical. 1.5 million is a lot, especially when scattered randomly throughout a giant (how big is it?) table.

You can almost certainly speed this up by clustering the table on the "index_events_on_type" index. But this a resource intensive operation which will probably require a maintenance window to do, and it won't stay clustered upon future inserts/updates. Alternatively, you could partition the table by "type". This has more or less the same effect as clustering (in this context!) but it stays partitioned in the face of future actions.

What version of PostgreSQL are you using? I would have expected you to get a parallel query here if it is recent version (not that it will necessarily be effective, but still I would have expected one).

Given that you are retrieving ~4% of the table, a seq scan might be more effective than an index scan. You can do set enable_bitmapscan=off and see how that does.