Postgresql – Long running VACUUM VERBSE in Postgres 9.5 – how much time remaining

postgresqlvacuum

I am vacuuming a very large (718GB) table containing a lot of dead tuples and 4 indexes.

VACUUM ANALYSE VERBOSE bigtable;

This vacuum has been running for over 11 hours (!) now and I am trying to estimate the remaining time for it to complete, or, at least understand in which "phase" the vacuum process is: scanning heap? cleaning indexes? vacuuming heap? I am referring to the phases described in Vacuum Progress Reporting, introduced in 9.6.

Since I am running Postgres 9.5, I do not have access to this nifty feature. I am trying to make sense of the "verbose" vacuum messages, which look like this:

INFO:  scanned index "bigtable_timestamp" to remove 11184548 row versions
DETAIL:  CPU 77.05s/336.97u sec elapsed 745.75 sec
INFO:  scanned index "bigtable_trip_id" to remove 11184548 row versions
DETAIL:  CPU 87.86s/371.96u sec elapsed 846.21 sec
INFO:  "bigtable": removed 11184548 row versions in 157984 pages
DETAIL:  CPU 3.50s/1.93u sec elapsed 22.42 sec
INFO:  scanned index "bigtable_pkey" to remove 11184526 row versions
DETAIL:  CPU 72.27s/330.03u sec elapsed 704.25 sec
INFO:  scanned index "bigtable_step_id" to remove 11184526 row versions
DETAIL:  CPU 90.21s/370.86u sec elapsed 850.30 sec

The above sequence of messages has repeated itself for the past 11 hours, with no end in sight.

Is there any way to tell, based on those messages, in which phase the vaccum is now?

I've also tried to look at the bytes-read of the vacuum process on Linux, as suggested in this answer, and compare it to the total_bytes of the table and indexes, but this didn't work for me. The process has already read 5.1TB (5101054656512) from the disk although the table including all indexes is estimated at "only" 718GB (pg_class.total_bytes).

Best Answer

You can find the pid of the vacuum process, then do a strace -y -p <pid> and see what part of the table it is currently reading, then compare that to the overall size of the table. You don't want to leave strace running all the time, as it slows the process down, so see what you need then ctrl-c out of the strace. You will have to catch it when it is reading the table, not an index, as seeing where it is at in an index is not informative.

Alternatively, if you have the entire history of that verbose output, you can see where it is by summing all of the rows and pages from INFO: "bigtable": removed 11184548 row versions in 157984 pages and comparing that to what the size of the table and number of dead rows you believe it to have.

What you've shown here starts in the middle of one cycle and ends in the middle of the next one. It needs to scan all the indexes each time maintenance_work_mem fills up, and since each dead tuple take about 6 bytes and the default value of maintenance_work_mem is 64MB, that closely matches what you are seeing (64MB/6 = 11184810).

I think there is one step that VACUUM VERBOSE doesn't log which it probably should. First it scans part of the table, accumulating dead tuples until it hits the end of the table or until maintenance_work_mem is full. It I don't think it logs the end of this step. Then it scans each index, logging the info for each one. Then it rescans that same part of the table, marking the dead tuples as being vacuumed away, and logs that info.

I'd interrupt the vacuum and increase maintenance_work_mem and restart. Some amount of work may be lost, but it should be small compared to the speed up you will see going forward. Unless you conclude it is already like 99% done or something.

I'd also change maintenance_work_mem to 1GB in postgresql.conf, as well as lowering autovacuum_vacuum_cost_delay to 2ms, to try to prevent future occurrences of similar problems.

But more importantly, I'd upgrade off of 9.5 ASAP. It is still "supported" as far as security fixes are concerned, but people with stringent performance requirements shouldn't be using a version that old. Yes, it is common management conundrum: "Doing X, Y, and Z are all a higher priority than upgrading". Well, upgrading will probably make X, Y, and Z easier to do.