PostgreSQL Performance – Simple Query on 300M Row Table is Very Slow

amazon-rdspostgresqlpostgresql-performance

Hardware/PostgreSQL version:

AWS RDS db.r4.xlarge (4vCPU, 30.5GB RAM, Provisioned IOPS (SSD) storage, 2500 IOPS)

PostgreSQL version 11.

Background:

I have an ever-growing table that I feel isn't anywhere near as performant as it should be.

With around 300M rows and growing by ~4M a month, the table looks as follows:

(
    proprietary_id text,
    date date,
    instance_id text,
    title text,
    type text,
    earnings numeric(19,6),
    date_paid date,
    report_type text,
    state text,
    user_type text,
    platform_type text
)

The fields that are of type text have no real standard from the data source, so this is a frustrating but necessary generic data type.

I have indexes on proprietary_id (the main identifier for a given row) and date_paid

A simple query like:

select sum(earnings) from "my-slow-table" where date_paid = '2020-04-01'

Takes over 7 minutes to run. Just about every query on this table, regardless of complexity (within reason) seems to take this long. I am by no means a DB expert, but I have just enough experience to get myself in trouble with you all here thinking "I've done what I should do to make this thing faster". VACUUM ANALYZE has been ran on this bad boy after any large insert/update/delete as well as 15 minutes before posting this.

Question:

What else could I try to speed things up? I know I could start partitioning, but I feel like this performance is abysmal from researching query times of much larger tables online – we don't even come close to maxing resources on the RDS instance itself so perhaps there's some postgres configuration that needs to happen to improve things?

Please forgive me if this is a silly question or has a simple answer – I've just exhausted my knowledge to this point. Happy to learn and looking for resources to expand my knowledge!

PS

Here's the long query plan output of EXPLAIN ANALYZE (which I need to get much better at fully understanding):

EDIT 1

Replaced with FORMAT TEXT instead of the JSON format

"Finalize Aggregate  (cost=6872714.58..6872714.59 rows=1 width=32) (actual time=415295.495..415295.496 rows=1 loops=1)"
"  ->  Gather  (cost=6872714.36..6872714.57 rows=2 width=32) (actual time=415291.983..415296.893 rows=3 loops=1)"
"        Workers Planned: 2"
"        Workers Launched: 2"
"        ->  Partial Aggregate  (cost=6871714.36..6871714.37 rows=1 width=32) (actual time=415291.643..415291.644 rows=1 loops=3)"
"              ->  Parallel Seq Scan on "my-slow-table"  (cost=0.00..6860703.50 rows=4404341 width=4) (actual time=194857.517..413608.182 rows=3663864 loops=3)"
"                    Filter: (date_paid = '2020-08-01'::date)"
"                    Rows Removed by Filter: 68302664"
"Planning Time: 0.114 ms"
"Execution Time: 415296.963 ms"

EDIT 2

And here's the same explain after confirming both indexes on proprietary_id and date_paid AND setting enable_seqscan = off;:

"Finalize Aggregate  (cost=7170994.77..7170994.78 rows=1 width=32) (actual time=19354.251..19354.252 rows=1 loops=1)"
"  ->  Gather  (cost=7170994.55..7170994.76 rows=2 width=32) (actual time=19353.345..19357.306 rows=3 loops=1)"
"        Workers Planned: 2"
"        Workers Launched: 2"
"        ->  Partial Aggregate  (cost=7169994.55..7169994.56 rows=1 width=32) (actual time=19350.550..19350.551 rows=1 loops=3)"
"              ->  Parallel Bitmap Heap Scan on "my-slow-table"  (cost=197953.32..7158983.69 rows=4404341 width=4) (actual time=541.486..17691.885 rows=3663864 loops=3)"
"                    Recheck Cond: (date_paid = '2020-08-01'::date)"
"                    Rows Removed by Index Recheck: 579438"
"                    Heap Blocks: exact=19364 lossy=86080"
"                    ->  Bitmap Index Scan on "my-slow-table-date-paid-idx"  (cost=0.00..195310.71 rows=10570419 width=0) (actual time=529.688..529.689 rows=10991594 loops=1)"
"                          Index Cond: (date_paid = '2020-08-01'::date)"
"Planning Time: 0.121 ms"
"Execution Time: 19357.390 ms"

Best Answer

You have a sequential scan where 19 out of 20 rows are discarded by the filter. Clearly you are missing a simple index:

CREATE INDEX ON "my-slow-table" (date_paid);

Oh, you already have that index, and it is used to the advantage of execution time if you disable sequential scans!

Then the reason why PostgreSQL chooses to avoid the index scan must be that work_mem is set so low that an effective, non-degenerated bitmap that contains a bit per table row won't fit. So it has to resort to a"lossy" bitmap, where some bits stand for a whole page, which leads to extra work and makes the plan unattractive.

Increase work_mem, and the optimizer will be more happy to choose the index scan, which will be even faster. While at it, review your setting for random_page_cost and effective_cache_size to see if they reflect the reality of your hardware, so that the optimizer's cost estimates teflect the actual costs better.