PostgreSQL 7.4 data corruption

postgresql

We're dealing with the following situation (PG version is 7.4.30): somehow, a DB has gone missing from pg_database (psql \l listing doesn't show it anymore). More, while you can still \c to it, a \d table listing only shows a fraction of the tables. A \d on the existing tables shows missing columns as well.. However, all the tables (still visible or not) may be queried by selects via psql and the actual data and columns in the result sets seem just fine.

We backed up the data folder and for the last 24h tried to recover the situation to a point where pg_dump works. Unfortunately we haven't been fully successful yet although we tried tons of suggestions in previous similar posts. Reindex didn't help, but a "VACUUM FULL FREEZE [ANALYZE]" did free 70% (14GB to 4) of the used disk space. Of course, the database wasn't vacuumed regularly as it should have been so:

WARNING: some databases have not been vacuumed in over 2 billion transactions
DETAIL: You may have already suffered transaction-wraparound data loss.

A full vacuum without the freeze option does make all databases, tables and users visible, but duplicated, i.e. \l then shows the database twice and a \d on the database lists all tables twice. The postgres user appears twice in pg_user, etc. pg_dump won't work if the database is not listed and won't either if the postgres user is duplicated..

Does this sound familiar to anyone? It's a pity that all the data seems to be there, but we're not able to recover and restore it to a clean DB. We'd appreciate any suggestions.

Thank you!

PS: I've waiting for feedback from the pgsql-admin mailing list as well.

Best Answer

First, if you haven't done the first two that vyegorov suggests do it now. If you have already vacuumed, it is not clear what sort of damage you may have already done in this case. This error should not be possible on new versions of PostgreSQL due to a number of checks, and had you upgraded some time ago, this wouldn't have happened. I highly recommend trying to stay on supported branches in the future.

I want to take a moment and describe what likely caused the problem and what it means. The prognosis IMO is not good and recovering the data, if it is even possible, is likely to be expensive and time consuming. I sincerely hope you have a good backup from before the transaction wraparound. If not, ouch....

What Went Wrong

PostgreSQL uses something called MVCC, which means that old versions of rows are kept around until they are clearly no longer used. MVCC as practiced by PostgreSQL stamps each row with a minimum transaction and a maximum transaction, and uses these for visibility management. When a transaction rolls back those rows entered are no longer visible and those rows deleted remain visible. Transaction ID's are 32-bit integers.

Periodically you are supposed to vacuum PostgreSQL instances. This, among other things, manages MVCC so that fewer transaction id's need to be checked, manages free space in tables, and can reset the transaction id sequence.

When the transaction ID wraps around, bad things happen but they boil down to the fact that PostgreSQL can no longer be sure what rows are visible and what rows are not. Note thais applies not only to rows in your own tables but in the system catalogs as well. This is a very, very bad thing. The best approach, if you can do it, is to restore from a backup before the wraparound occurred. This is why databases and tables are not showing up and why vacuuming full duplicates them.

Why Upgrading Would have Fixed This

Modern versions of PostgreSQL come with autovacuum automatically enabled which runs vacuum processes in the background to help prevent this sort of problem. More recent versions also will refuse to start new transactions for non-superusers once wraparound is approaching. This gives you a chance to detect and correct the problem before you suffer possibly catastrophic data loss.

PostgreSQL 7.4 hasn't been supported in nearly three years. I don't know when the last time anything was vacuumed but it must have been billions of transactions ago. This is not good.

Related Question