PostgreSQL 9.5 database showing signs of corruption in logs, though clients work normally

corruptionpostgresqlpostgresql-9.5

We have a database server running PostgreSQL 9.5 on a Ubuntu 16.04 LTS box, with the total data size just under 30 GB.

We've set archive_mode to on, and archive_command is rsyncing WAL files to another server that is running Barman. To prepare for an eventual PostgreSQL upgrade using pg_upgrade, we've set up a disposable server instance also running PostgreSQL 9.5 that we're using as the target for barman recover. The idea is to get a snapshot of the DB there that we can test pg_upgrade on, to a PostgreSQL 12 instance running on the same server.

It is in this process, doing barman recover to the disposable server, that I discovered an issue. The target DB was stopped, and after Barman had finished rsyncing the contents of the data directory there, I restarted the server. It came online just fine and accepted queries as usual, but this error started appearing in the log file:

2020-12-08 12:05:20 EET ERROR:  could not access status of transaction 79509466
2020-12-08 12:05:20 EET DETAIL:  Could not open file "pg_clog/004B": No such file or directory.
2020-12-08 12:05:20 EET CONTEXT:  automatic vacuum of table "template0.pg_catalog.pg_statistic"

It got printed four times on server startup, and then again every 60 seconds.

I first assumed that this was an issue with Barman or the data on its server, but since no operational problems could be found there, I turned my eye to the production server itself. Lo and behold, ever since the Oct 21, that error has appeared on the production server. So this was never about something being wrong about the backup or restoration process, it was the actual production data that was wrong!

This has gone unnoticed. Nothing happened on the server (a VPS on Upcloud) on Oct 21 that we can discern: all of our web applications–for which we have robust error alerts–relying on the server have kept working normally throughout. Nobody can remember any manual operations having been run there on the day. It's possible that this was an issue with the VPS provider, who on Oct 28 reported having detected a storage backend problem, which was quickly resolved afterwards. This was a week after our problem first appeared in the logs, but I suppose that the symptoms may have started earlier.

Alphabetically the first file inside pg_clog is 004C. Based on some googled advice, I tried creating a 256k file of all zeroes called 004B. After I did, the error changed to:

2020-12-08 13:35:25 EET CONTEXT:  automatic vacuum of table "template0.pg_catalog.pg_statistic"
2020-12-08 13:35:40 EET ERROR:  found xmax 79509466 from before relfrozenxid 80163082

(This was on the disposable server that was the barman recover target, I haven't dared touch production.) This gets printed at 15 second intervals.

I have no idea how to further analyze, let alone fix this issue. Everything is working normally as far as our database clients are concerned, but this needs to be corrected. Any help is appreciated. Note that my knowledge of PostgreSQL data storage internals is close to zero.

Best Answer

If only pg_statistic is affected, you might get out of this cheap:

Set allow_system_table_mods = on in postgresql.conf and restart PostgreSQL. Then run

TRUNCATE pg_catalog.pg_statistic;
ANALYZE;

Don't forget to reset allow_system_table_mods when you are done.

But you shouldn't ever trust a PostgreSQL database cluster that has had database corruption. Create a new cluster, export the old one with pg_dumpall and import it into the new cluster. Then discard the cluster that had data corruption.

You might want to create the new cluster with --data-checksums so that you get an error message if data change while on disk.

Now you should investigate the possible causes. Upgrade to the latest minor release and test your hardware for problems.