PostgreSQL extremely slow counts

amazon-rdspostgresqlpostgresql-11query-performance

I'm experiencing extremely slow count(*) speeds, inside a table with 5845897rows, counting the number of rows takes around 2 minutes (113832.950 ms).

EXPLAIN ANALYZE SELECT COUNT(*) FROM campaigns;
                                  QUERY PLAN
--------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=1295067.02..1295067.03 rows=1 width=8) (actual time=113830.691..113830.691 rows=1 loops=1)
   ->  Gather  (cost=1295066.80..1295067.01 rows=2 width=8) (actual time=113830.603..113832.899 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=1294066.80..1294066.81 rows=1 width=8) (actual time=113828.327..113828.328 rows=1 loops=3)
               ->  Parallel Seq Scan on campaigns  (cost=0.00..1287889.84 rows=2470784 width=0) (actual time=2.560..113139.782 rows=1948632 loops=3)
 Planning Time: 0.130 ms
 Execution Time: 113832.950 ms
(8 rows)

After vacuuming the table, I got the following results, but the query times remaining unchanged.

SELECT relname AS TableName,n_live_tup AS LiveTuples,n_dead_tup AS DeadTuples, last_autovacuum AS Autovacuum, last_autoanalyze AS Autoanalyze FROM pg_stat_user_tables;
          tablename          | livetuples | deadtuples |          autovacuum           |          autoanalyze
-----------------------------+------------+------------+-------------------------------+-------------------------------
| campaigns                  |    5848489 |      84122 | 2020-11-21 15:27:54.309192+00 | 2020-11-21 15:29:38.547147+00

I would expect that this count would be quite quick even accounting for the size of the database.

The database is AWS RDS PostgreSQL 11.8, with 16GB of ram.

Update 1

Machine class: db.m4.xlarge – vCPU: 4, ECU: 13, RAM: 16GB, Storage: General Purpose (SSD) with 440GB

Repeating the query now with the (ANALYZE, BUFFERS) and io timings enabled:

EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*) FROM campaigns;
                   QUERY PLAN
------------------------------------------------------------------------
 Finalize Aggregate  (cost=1311747.11..1311747.12 rows=1 width=8) (actual time=124550.322..124550.323 rows=1 loops=1)
   Buffers: shared hit=14 read=1279450 dirtied=449 written=167
   I/O Timings: read=364834.696 write=2.056
   ->  Gather  (cost=1311746.90..1311747.11 rows=2 width=8) (actual time=124550.218..124552.012 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=14 read=1279450 dirtied=449 written=167
         I/O Timings: read=364834.696 write=2.056
         ->  Partial Aggregate  (cost=1310746.90..1310746.91 rows=1 width=8) (actual time=124546.286..124546.286 rows=1 loops=3)
               Buffers: shared hit=14 read=1279450 dirtied=449 written=167
               I/O Timings: read=364834.696 write=2.056
               ->  Parallel Seq Scan on campaigns  (cost=0.00..1304490.32 rows=2502632 width=0) (actual time=0.298..123798.537 rows=1948646 loops=3)
                     Buffers: shared hit=14 read=1279450 dirtied=449 written=167
                     I/O Timings: read=364834.696 write=2.056
 Planning Time: 8.986 ms
 Execution Time: 124552.079 ms
(16 rows)

Update 2

After a couple of hours around this issue, noticed that the system had depleted its IOPS, now running the query with the sufficient IOPS we get better results, query time came down from 124552.079 ms to 26939.749 ms, a ~78% performance increase, but even so, 26 seconds for the query to run is quite a lot.

EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*) FROM campaigns;
                      QUERY PLAN
------------------------------------------------------------------------
 Finalize Aggregate  (cost=1311793.21..1311793.22 rows=1 width=8) (actual time=26938.554..26939.649 rows=1 loops=1)
   Buffers: shared hit=20 read=1279489 dirtied=11
   I/O Timings: read=77031.806
   ->  Gather  (cost=1311793.00..1311793.21 rows=2 width=8) (actual time=26938.465..26939.642 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=20 read=1279489 dirtied=11
         I/O Timings: read=77031.806
         ->  Partial Aggregate  (cost=1310793.00..1310793.01 rows=1 width=8) (actual time=26936.559..26936.560 rows=1 loops=3)
               Buffers: shared hit=20 read=1279489 dirtied=11
               I/O Timings: read=77031.806
               ->  Parallel Seq Scan on campaigns  (cost=0.00..1304536.20 rows=2502720 width=0) (actual time=0.310..26764.411 rows=1948667 loops=3)
                     Buffers: shared hit=20 read=1279489 dirtied=11
                     I/O Timings: read=77031.806
 Planning Time: 3.261 ms
 Execution Time: 26939.749 ms
(16 rows)

Would increasing work_mem make any difference to this, I'm thinking that the index or table metadata doesn't fit one of the buffers, making the system having to read from the disk. But don't notice any of the main indicators that work_mem is an issue, like external merge Disk message, and from what I understood work_mem is used for sort's and joins as working space for data manipulation and I don't see the need for that in this query.

Update 3

After cloning the production database with exactly the same machine type, and settings, I found that it performs as one would expect, with "significant" low query times.

The following is a cold query, the server was rebooted and no query by other client was performed before this one:

EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*) FROM campaigns;

                     QUERY PLAN
---------------------------------------------------------
 Finalize Aggregate  (cost=1311857.75..1311857.76 rows=1 width=8) (actual time=1952.143..1953.241 rows=1 loops=1)
   Buffers: shared read=1279572
   I/O Timings: read=2462.697
   ->  Gather  (cost=1311857.54..1311857.75 rows=2 width=8) (actual time=1952.061..1953.232 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared read=1279572
         I/O Timings: read=2462.697
         ->  Partial Aggregate  (cost=1310857.54..1310857.55 rows=1 width=8) (actual time=1949.575..1949.576 rows=1 loops=3)
               Buffers: shared read=1279572
               I/O Timings: read=2462.697
               ->  Parallel Seq Scan on campaigns  (cost=0.00..1304600.43 rows=2502843 width=0) (actual time=0.011..1776.895 rows=1948706 loops=3)
                     Buffers: shared read=1279572
                     I/O Timings: read=2462.697
 Planning Time: 0.646 ms
 Execution Time: 1953.348 ms
(16 rows)

Re-running the query a second time I get a minor speed improvement, we can see that the query plan time got a huge speed bump, other than that it's the same.

EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*) FROM campaigns;

                QUERY PLAN
-------------------------------------------------------------------------
 Finalize Aggregate  (cost=1311857.75..1311857.76 rows=1 width=8) (actual time=1941.837..1942.902 rows=1 loops=1)
   Buffers: shared hit=96 read=1279476
   I/O Timings: read=2449.609
   ->  Gather  (cost=1311857.54..1311857.75 rows=2 width=8) (actual time=1941.757..1942.894 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=96 read=1279476
         I/O Timings: read=2449.609
         ->  Partial Aggregate  (cost=1310857.54..1310857.55 rows=1 width=8) (actual time=1939.694..1939.695 rows=1 loops=3)
               Buffers: shared hit=96 read=1279476
               I/O Timings: read=2449.609
               ->  Parallel Seq Scan on campaigns  (cost=0.00..1304600.43 rows=2502843 width=0) (actual time=0.016..1768.031 rows=1948706 loops=3)
                     Buffers: shared hit=96 read=1279476
                     I/O Timings: read=2449.609
 Planning Time: 0.082 ms
 Execution Time: 1942.933 ms
(16 rows)

Now, I rebooted the database and disabling the parallel workers, we can see an impact in performance, meaning that parallelism was improving our query speed, but the speed in this machine is in magnitudes greater than the production machine.

SET max_parallel_workers_per_gather = 0;
EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*) FROM campaigns;

                    QUERY PLAN
----------------------------------------------------------------------
 Aggregate  (cost=1354657.29..1354657.30 rows=1 width=8) (actual time=3798.482..3798.483 rows=1 loops=1)
   Buffers: shared read=1279572
   I/O Timings: read=1662.194
   ->  Seq Scan on campaigns  (cost=0.00..1339640.23 rows=6006823 width=0) (actual time=0.008..3422.269 rows=5846117 loops=1)
         Buffers: shared read=1279572
         I/O Timings: read=1662.194
 Planning Time: 0.603 ms
 Execution Time: 3798.566 ms
(8 rows)

Regarding the table "campaigns" it's quite huge,


                                       Table "public.campaigns"
         Column          |            Type             | Collation | Nullable |        Default
-------------------------+-----------------------------+-----------+----------+-----------------------
 id                      | uuid                        |           | not null | uuid_generate_v4()
 template_id             | uuid                        |           |          |
 organization_id         | uuid                        |           |          |
 certificate_id          | uuid                        |           |          |
 pass_type_id            | uuid                        |           |          |
 source                  | character varying(255)      |           |          |
 observations            | character varying(255)      |           |          |
 slug                    | character varying(255)      |           | not null |
 name                    | character varying(255)      |           | not null |
 pass_data               | json                        |           |          |
 deleted_at              | timestamp without time zone |           |          |
 source_version          | integer                     |           |          | 1
 code_type               | integer                     |           | not null |
 code_fixed_value        | character varying(255)      |           |          |
 certificate_type        | integer                     |           |          |
 allow_multiple_redeems  | boolean                     |           |          | false
 facebook_manager_id     | uuid                        |           |          |
 current_distribution_id | uuid                        |           |          |
 created_at              | timestamp without time zone |           |          |
 updated_at              | timestamp without time zone |           |          |
 poster                  | character varying(255)      |           |          |
 og                      | boolean                     |           |          | true
 og_description          | character varying           |           |          | ''::character varying
 javascript              | text                        |           |          | ''::text
 stylesheet              | text                        |           |          | ''::text

Indexes:
    "campaigns_pkey" PRIMARY KEY, btree (id)
    "index_campaigns_on_slug" UNIQUE, btree (slug)
    "index_campaigns_on_certificate_id" btree (certificate_id)
    "index_campaigns_on_code_type" btree (code_type)
    "index_campaigns_on_created_at" btree (created_at)
    "index_campaigns_on_deleted_at" btree (deleted_at)
    "index_campaigns_on_name" btree (name)
    "index_campaigns_on_organization_id" btree (organization_id)
    "index_campaigns_on_template_id" btree (template_id)

I imagine that the shared memory gets evicted often because I have lots of queries in production with other tables of similar size.

Update 5

I'm feeling the the culprit here is the size of the tuples and the amount of memory that I have.

Right now the machine has 16GB of ram, with ~4GB of reserved memory for the shared_buffers (shared_buffers = 25% of the system ram)

Looking at the amount of memory that we have reserved we can see that is in fact 4gb.

=> show shared_buffers;
 shared_buffers
----------------
 3979968kB

Now let's look how the shared_buffers are being used:

SELECT
  c.relname,
  pg_size_pretty(count(*) * 8192) as buffered,
  round(100.0 * count(*) /
    (SELECT setting FROM pg_settings
      WHERE name='shared_buffers')::integer,1)
    AS buffers_percent,
  round(100.0 * count(*) * 8192 /
    pg_table_size(c.oid),1)
    AS percent_of_relation
FROM pg_class c
  INNER JOIN pg_buffercache b
    ON b.relfilenode = c.relfilenode
  INNER JOIN pg_database d
    ON (b.reldatabase = d.oid AND d.datname = current_database())
GROUP BY c.oid,c.relname
ORDER BY 3 DESC
LIMIT 10;
            relname             |  buffered  | buffers_percent | percent_of_relation
--------------------------------+------------+-----------------+---------------------
 cars                           | 3683 MB    |            94.8 |                30.4
 registrations                  | 164 MB     |             4.2 |                52.6
 index_registrations_on_pass_id | 33 MB      |             0.9 |                89.7
 pg_amop                        | 64 kB      |             0.0 |                66.7
 pg_cast                        | 8192 bytes |             0.0 |                16.7
 pg_index                       | 32 kB      |             0.0 |                30.8
 pg_attrdef                     | 16 kB      |             0.0 |                16.7
 pg_opclass                     | 24 kB      |             0.0 |                42.9
 pg_operator                    | 88 kB      |             0.0 |                57.9
 pg_namespace                   | 8192 bytes |             0.0 |                20.0
(10 rows)

We can see that there is a table called "cars" that fills 94.8% of our shared_buffers space, but currently just 30.4% of it's data is loaded in to memory. We can see that we don't have any "campaigns" table data in the top ten (10) cached resources.

Now running the query problematic query again:

=> EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*) FROM campaigns;
              QUERY PLAN
-------------------------------------------------------------
 Finalize Aggregate  (cost=1312445.79..1312445.80 rows=1 width=8) (actual time=30856.927..30858.016 rows=1 loops=1)
   Buffers: shared hit=14 read=1280132 dirtied=1 written=1
   I/O Timings: read=88505.317 write=0.026
   ->  Gather  (cost=1312445.57..1312445.78 rows=2 width=8) (actual time=30856.849..30858.008 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=14 read=1280132 dirtied=1 written=1
         I/O Timings: read=88505.317 write=0.026
         ->  Partial Aggregate  (cost=1311445.57..1311445.58 rows=1 width=8) (actual time=30853.057..30853.058 rows=1 loops=3)
               Buffers: shared hit=14 read=1280132 dirtied=1 written=1
               I/O Timings: read=88505.317 write=0.026
               ->  Parallel Seq Scan on campaigns  (cost=0.00..1305185.66 rows=2503966 width=0) (actual time=1.649..30669.124 rows=1949005 loops=3)
                     Buffers: shared hit=14 read=1280132 dirtied=1 written=1
                     I/O Timings: read=88505.317 write=0.026
 Planning Time: 54.656 ms
 Execution Time: 30858.072 ms
(16 rows)

We can see that postgres read 88505.317 blocks from disk (read=88505.317) and just loaded 14 blocks (hits=14) from ram. This means that query read 708.04253 megabytes (88505.317*8kb) from disk.

Looking how the shared_buffers cache looks like after the query:




            relname             |  buffered  | buffers_percent | percent_of_relation
--------------------------------+------------+-----------------+---------------------
 cars                           | 1952 MB    |            50.2 |                16.1
 campaigns                      | 1718 MB    |            44.2 |                17.2
 registrations                  | 176 MB     |             4.5 |                56.6
 index_registrations_on_pass_id | 34 MB      |             0.9 |                91.2
 index_distributions_on_slug    | 3344 kB    |             0.1 |                 1.1
 pg_opclass                     | 16 kB      |             0.0 |                28.6
 pg_index                       | 32 kB      |             0.0 |                30.8
 pg_amop                        | 8192 bytes |             0.0 |                 8.3
 pg_amproc_fam_proc_index       | 24 kB      |             0.0 |                60.0
 pg_attrdef                     | 16 kB      |             0.0 |                16.7
(10 rows)

We can see that the postgres committed to cache 1718 MB of data from the last campaigns query that we run, but for that it evicted 1731 MB of cars cached data.

Now re-running the query:

=> EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*) FROM campaigns;
                        QUERY PLAN
-----------------------------------------------------------------------------
 Finalize Aggregate  (cost=1312446.81..1312446.82 rows=1 width=8) (actual time=62265.219..62266.596 rows=1 loops=1)
   Buffers: shared hit=440307 read=839840 dirtied=2
   I/O Timings: read=183038.938
   ->  Gather  (cost=1312446.59..1312446.80 rows=2 width=8) (actual time=62265.116..62266.586 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=440307 read=839840 dirtied=2
         I/O Timings: read=183038.938
         ->  Partial Aggregate  (cost=1311446.59..1311446.60 rows=1 width=8) (actual time=62262.215..62262.217 rows=1 loops=3)
               Buffers: shared hit=440307 read=839840 dirtied=2
               I/O Timings: read=183038.938
               ->  Parallel Seq Scan on campaigns  (cost=0.00..1305186.68 rows=2503968 width=0) (actual time=0.399..62075.601 rows=1949005 loops=3)
                     Buffers: shared hit=440307 read=839840 dirtied=2
                     I/O Timings: read=183038.938
 Planning Time: 0.082 ms
 Execution Time: 62266.644 ms
(16 rows)

We can see that postgres used 440307 blocks (hit=440307) from the shared_buffers cache and reads 183038.938 blocks (~1.4643115 GB) from disk.

We can see that after the new run, campaigns cache increased in expense of the cars cache.

                    relname                    |  buffered  | buffers_percent | percent_of_relation
-----------------------------------------------+------------+-----------------+---------------------
 campaigns                                     | 2294 MB    |            59.0 |                22.9
 cars                                          | 1205 MB    |            31.0 |                 9.9
 registrations                                 | 311 MB     |             8.0 |               100.0
 index_registrations_on_pass_id                | 33 MB      |             0.9 |                89.3
 index_cars_on_campaign_id_and_serial_number   | 29 MB      |             0.7 |                 3.1
 index_distributions_on_slug                   | 3440 kB    |             0.1 |                 1.2
 pg_aggregate_fnoid_index                      | 16 kB      |             0.0 |               100.0
 pg_opclass                                    | 8192 bytes |             0.0 |                14.3
 pg_amproc                                     | 8192 bytes |             0.0 |                11.1
 pg_constraint_conrelid_contypid_conname_index | 8192 bytes |             0.0 |                50.0
(10 rows)

Running the campaigns count query once again:

=> EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*) FROM campaigns;
                           QUERY PLAN
------------------------------------------------------------------------
 Finalize Aggregate  (cost=1312448.86..1312448.87 rows=1 width=8) (actual time=38326.226..38327.412 rows=1 loops=1)
   Buffers: shared hit=63516 read=1216633 dirtied=2 written=1
   I/O Timings: read=110828.341 write=0.020
   ->  Gather  (cost=1312448.65..1312448.86 rows=2 width=8) (actual time=38326.118..38327.403 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=63516 read=1216633 dirtied=2 written=1
         I/O Timings: read=110828.341 write=0.020
         ->  Partial Aggregate  (cost=1311448.65..1311448.66 rows=1 width=8) (actual time=38323.817..38323.819 rows=1 loops=3)
               Buffers: shared hit=63516 read=1216633 dirtied=2 written=1
               I/O Timings: read=110828.341 write=0.020
               ->  Parallel Seq Scan on campaigns  (cost=0.00..1305188.72 rows=2503972 width=0) (actual time=0.290..38135.617 rows=1949007 loops=3)
                     Buffers: shared hit=63516 read=1216633 dirtied=2 written=1
                     I/O Timings: read=110828.341 write=0.020
 Planning Time: 0.084 ms
 Execution Time: 38327.450 ms
(16 rows)

It shows us two things, we are limited by the amount of RAM we currently have for the shared_buffers, and on the size of the tuples being used in the database.

I made the decision on increasing the amount of RAM from 16gb to 32gb so that we could have a increase from 4gb of shared_buffers to 8gb, this also increases the amount of ram available for the SO that helps postgres since it relies in heavily on the SO cache to be performant.

Running the query once again, we see a huge speed increase, even with a high miss rate, meaning that we are probably hitting the OS cache, right now I don't know how to check for that but's a thing for the next iteration.

=> EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*) FROM campaigns;
                  QUERY PLAN
-------------------------------------------
 Finalize Aggregate  (cost=1312450.91..1312450.92 rows=1 width=8) (actual time=2050.114..2051.202 rows=1 loops=1)
   Buffers: shared hit=217 read=1279934
   I/O Timings: read=2475.790
   ->  Gather  (cost=1312450.69..1312450.90 rows=2 width=8) (actual time=2050.032..2051.193 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=217 read=1279934
         I/O Timings: read=2475.790
         ->  Partial Aggregate  (cost=1311450.69..1311450.70 rows=1 width=8) (actual time=2047.558..2047.559 rows=1 loops=3)
               Buffers: shared hit=217 read=1279934
               I/O Timings: read=2475.790
               ->  Parallel Seq Scan on campaigns  (cost=0.00..1305190.75 rows=2503975 width=0) (actual time=0.018..1875.699 rows=1949009 loops=3)
                     Buffers: shared hit=217 read=1279934
                     I/O Timings: read=2475.790
 Planning Time: 0.083 ms
 Execution Time: 2051.239 ms
(16 rows)
relname             |  buffered  | buffers_percent | percent_of_relation
---------------------------------+------------+-----------------+---------------------
cars                            | 5679 MB    |            72.1 |                46.8
cars_pkey                       | 428 MB     |             5.4 |                99.9
registrations                   | 311 MB     |             3.9 |               100.0
index_cars_on_distribution_id   | 138 MB     |             1.8 |                12.9
pg_amproc                       | 40 kB      |             0.0 |                55.6
pg_attrdef                      | 64 kB      |             0.0 |                66.7
pg_aggregate                    | 8192 bytes |             0.0 |                16.7
pg_index                        | 72 kB      |             0.0 |                69.2
pg_amop                         | 64 kB      |             0.0 |                66.7
pg_constraint                   | 16 kB      |             0.0 |                28.6
(10 rows)

I increased the machine memory from 16gb to 32gb helped a lot, but meanwhile, this whole ordeal exposed a huge issue, tuples need to optimized since they are too big meaning that the campaign schema needs to be rethinking.

Leaving some pages that helped me:

Best Answer

Essentially all the time is spent waiting for data to be read from disk. Indeed, 3 times as much time is spent waiting as the total time, because you have 3 processes to do the waiting simultaneously. You might want to turn off parallelization, as it is unlikely to actually be helpful here. It might actually be harmful, and it at least makes the plans harder to interpret.

Since you are reading about 10GB of data but have 16GB of RAM, you might hope that it would stay in cache and so be faster in the future. Have you tried running this repeatedly back to back to see if it does get faster? Is this the only large table you have or is it just one of several? Maybe you could try pg_prewarm on it.

now running the query with the sufficient IOPS we get better results, query time came down from 124552.079 ms to 26939.749 ms, a ~78% performance increase, but even so, 26 seconds for the query to run is quite a lot.

I don't know what you mean by 'sufficient IOPS'. If you just mean you have enough credits to get the burst rate of 3000 IOPS (rather than the base rate of 3*440 = 1320) you would expect 1279489/3000 = 426 seconds, which is way slower than you are actually seeing. So maybe most of your data is cached in RAM, but the part that is not cached is still enough to dominate your run time. But anyway, essentially all your time is still spent waiting for IO, so the IOPS is empirically not sufficient for something else to have become the bottleneck.

Your tuples seem to average 1792 bytes each (1279489*8k/5845897). This is very large, and doesn't even include the size which has been pushed off to TOAST. Is this size sensible given your data? Maybe you have a lot of bloat and your tuples aren't really this big. Given that large size, if the table is well vacuumed then doing an index-only scan should be able to count the tuples with far less IO. Do you have a qualifying index? If you vacuum the table and set enable_seqscan to off, what execution plan do you get? You could encourage index-only-scan by lowering random_page_cost to be close to seq_page_cost, which is probably a reasonable thing to do anyway if you are on SSD.

But this doesn't seem like the type of query you need to run all that often. Why do you care much about its performance? Are you taking it as a bellwether for performance in general? If so, that is probably a mistake.