Vacuum on large table sometimes takes very long

postgresqlpostgresql-performancevacuum

In a PostgreSQL 11 database server I have a large table (932 million rows, 2150GB, not partitioned) into which I insert about 2-3 million rows every night with a batch job. The table was created with autovacuum_enabled=FALSE. After the insert I perform a VACUUM (ANALYZE) on the table, which usually takes about 30-40 seconds. However, once every few weeks this vacuum takes very long, such as today when it took 2:11 h:mm.

I am now considering doing a vacuum ahead of my nightly job so as to avoid these long run times during the job run. I had a look at the options described in the PostgreSQL 11 vacuum documentation page. I know that a VACUUM (FULL) is very expensive in terms of runtime and thus is probably out of the question. Will a VACUUM (FREEZE) be useful in my case? The other option, VACUUM (DISABLE_PAGE_SKIPPING) seems not relevant to my needs.


Edit (2021-09-27): looking back at the past 270 days (2021-01-01 to 2021-09-27), it took 16-60 seconds most days to do the vacuum. However on 9 days the vacuum exceeded 1000 seconds, on 4 out of which it even exceeded 3500 seconds (see graph below). Interesting is the time gap between these outliers, which lies between 25 and 35 days. It looks like something builds up over the span of these 25-35 days, necessitating a longer vacuum. Quite possibly this is the transaction ID wraparound mentioned in the comments below.

VACUUM performance graph


Edit (2022-02-08): In September 2021 we modified our script by changing:

VACUUM (ANALYZE)

to:

ANALYZE (VERBOSE)
VACUUM (FREEZE, VERBOSE)

Unfortunately, the about once monthly long runs of this job step still continue. However, with the splitting into two separate analyze and vacuum steps, as well as with the verbose output, we are much better able to tell what is going on.

On a fast day the output of the analyze and vacuum steps looks like this:

INFO:  analyzing "analytics.a_box_events"
INFO:  "a_box_events": scanned 30000 of 152881380 pages, containing 264294 live rows and 0 dead rows; 30000 rows in sample, 1346854382 estimated total rows
ANALYZE
INFO:  aggressively vacuuming "analytics.a_box_events"
INFO:  "a_box_events": found 0 removable, 3407161 nonremovable row versions in 406302 out of 152881380 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 108425108
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 152475078 frozen pages.
0 pages are entirely empty.
CPU: user: 4.53 s, system: 3.76 s, elapsed: 25.77 s.
INFO:  aggressively vacuuming "pg_toast.pg_toast_1361926"
INFO:  "pg_toast_1361926": found 0 removable, 3700 nonremovable row versions in 680 out of 465143 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 108425110
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 464463 frozen pages.
0 pages are entirely empty.
CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.05 s.
VACUUM

The analyze has no timing output, and the vacuum on the main table takes around 20-40 seconds. Analyze and vacuum together usually finish in under a minute.

Once every 35-45 days, the vacuum takes much longer, as several minutes are spent vacuuming each of the table's 39 indexes. There are two variations to this:

  1. No index row versions to be removed

Example output:

INFO:  index "idx_a_box_events_node_id" now contains 1347181817 row versions in 4038010 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 5.44 s, system: 13.45 s, elapsed: 230.59 s.

The entire vacuum's final output then looks like this:

INFO:  "a_box_events": found 0 removable, 2837554 nonremovable row versions in 340887 out of 153111143 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 108429069
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 152770256 frozen pages.
0 pages are entirely empty.
CPU: user: 669.94 s, system: 870.59 s, elapsed: 10501.17 s.
  1. Index row versions to be removed

Example output:

INFO:  scanned index "idx_a_box_events_node_id" to remove 2524 row versions
DETAIL:  CPU: user: 49.34 s, system: 11.42 s, elapsed: 198.63 s

and:

INFO:  index "idx_a_box_events_node_id" now contains 1228052362 row versions in 3478524 pages
DETAIL:  2524 index row versions were removed.
6 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.

The entire vacuum's final output then looks like this:

INFO:  "a_box_events": found 56 removable, 3851482 nonremovable row versions in 461834 out of 139126225 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 97583006
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 138664391 frozen pages.
0 pages are entirely empty.
CPU: user: 2367.93 s, system: 761.34 s, elapsed: 8138.76 s.

Such long run times are currently more a nuisance than a show-stopper. An upgrade to the latest Postgres server version is planned for late 2023. At that time we will try the INDEX_CLEANUP option to VACUUM, running this once every weekend, to avoid the costly clean up of the indexes in the daily VACUUM. In the meantime we have apparently no choice but to be patient.

Edit (2022-08-24): We have, in the meantime, partitioned our large table into monthly slices, giving us currently 56 partitions, with about 75-90 GB for the table and 45-55 GB for the indexes per partition. Once a night new data is added to the current month's partition, followed by an ANALYZE (VERBOSE) (we don't do any more daily VACUUM). On the first Sunday of a month we additionally do a VACUUM (FULL, ANALYZE) of the previous month's partition.

As can be expected, the daily ANALYZE (VERBOSE) on a single partition is pretty quick, averaging around 16-18 seconds per run, building from nearly 0 seconds on the first day of a month to a maximum toward the end of a month, after which the following INSERT and ANALYZE are applied to a new partition and are again faster:
ANALYZE performance graph

Best Answer

My theory on what is happening here is that for an INSERT only table, vacuum only needs to visit the parts of the table dirtied by the inserts, and can skip visiting the indexes at all.

But if it finds even one (in v11) dead tuple, then it needs to scan the entirety of all the indexes, and that can take a long time. I'm surprised it takes over 2 hours, but that is a mighty big table, so the indexes would be mighty big as well. Successful INSERT and COPY don't generate any dead tuples. But even one UPDATE or DELETE do, and so do any INSERT or COPY which rolled back. Does this theory seem compatible with what you are seeing? Do you ever have any DELETE or UPDATE of even one tuple, or ever have an INSERT that fails?

You could run a VACUUM at 9pm, then if it does have to scan all the indexes, then the one you run after the bulk load won't have to do it again--unless that bulk load itself is the thing which generated the dead tuples.

But what if you instead just run ANALYZE, then VACUUM, separately and in that order, after the bulk load? Is the problem that a long VACUUM is using too much IO when you start work and slows things down, or is the problem that a long VACUUM means the ANALYZE has not yet finished and so you get bad plans at the start of the work day?

Another problem is that if you are not doing anything to freeze parts of the table, then once an anti-wraparound VACUUM does kick in, you could be in for a very bad time. The first VACUUM FREEZE you do is going to generate an enormous amount of IO and take a long time, but after that it will only have to freeze small parts of the table at a time and so should be manageable. The nice thing about VACUUM FREEZE, unlike index scanning part of a VACUUM, is that if it is interrupted it doesn't lose all the work it has done so far, as the pages recorded as frozen in the visibility map can be skipped next time. So you can start VACUUM FREEZE and then just cancel if it seems to be causing problems, or if you leave your desk so you won't be around to monitor it for a while. Then once you are caught up, you can change the morning post-bulk-load VACUUM to a VACUUM FREEZE.