PostgreSQL DELETE FROM Error – Attempted to Delete Invisible Tuple

corruptiondisaster recoverypostgresqlpostgresql-9.5recovery

The error

Trying to delete tuples containing invalid timestamps with

DELETE FROM comments WHERE date > '1 Jan 9999' OR date < '1 Jan 2000' OR date_found > '1 Jan 9999' OR date_found < '1 Jan 2000';

ends in

ERROR:  attempted to delete invisible tuple

There is a mailing list from 2009 discussing the exact same error message, where OP had it fixed, but I find no explanation of how he did it or what might have led to this error.

I am helpless due to lack of hits on Google and to my limited knowledge of PostgreSQL.

What led to corruption

I've had a PostgreSQL 9.5.5 server (~4TB data, all the default settings, except for upped memory limits) running on Debian 8, when OS kernel panicked – probably while rebuilding /dev/md1 where swap was located. Prior to that, PostgreSQL ate up almost all of the disk space with a 400GB log file. OS never booted again, disk checks were OK, so I have booted from a LiveCD and backed up each block device to images, just in case. I have successfully rebuilt / directory from /dev/md2, fsck showed a clean filesystem, and I have backed up the PGDATA folder on to an external HDD.

What I did to attempt recovery

After I've formatted md devices and reinstalled the OS along with a fresh postgresql-9.5, I have stopped the PostgreSQL server, moved and chowned the PGDATA folder to postgres user, and started the server – everything seemed fine, there were no errors.

As soon as I've started pg_dumpall, it died with

Error message from server: ERROR:  timestamp out of range

I have naturally tried to delete the offending tuples, only to end up with the same invisible tuple error over and over again.

Things I tried

First off, DELETE queries failed because of damaged pages, so I've set following settings:

zero_damaged_pages = on
ignore_system_indexes = on
enable_indexscan = off
enable_bitmapscan = off
enable_indexonlyscan = off

Now I've noticed that when I run the same queries again, the server zeros out same pages over and over again, not sure what it means:

invalid page in block 92800 of relation base/16385/16443; zeroing out page

I've tried following in an undefined order:

pg_resetxlog -D $PGDATA did its job without any errors or messages
Deleted all indexes including pkey constraints
CREATE TABLE aaa AS (SELECT * FROM comments); leads to Segmentation fault on

heap_deform_tuple (tuple=tuple@entry=0x7f0d1be29b08, tupleDesc=tupleDesc@entry=0x7f0d1a35abe0, values=values@entry=0x7ffd57a5beb0, isnull=isnull@entry=0x7ffd57a65af0 "\001\001")
It is reproducible and leaves a ~9GB core dump.
SELECT COUNT(*) from comments; allowed VACUUM comments; to complete, same trick doesn't work on other tables.
SELECT COUNT(*) from photos; and VACUUM photos; now dies with ERROR: MultiXactId 302740528 has not been created yet -- apparent wraparound – this one haunts every table, where other errors don't pop up anymore.

Thoughts

~~DB was getting hammered by lots of (possibly duplicate) writes with ON CONFLICT clause~~ DB was doing a VACUUM when kernel panic occured, I believe it is what's left of it that is causing trouble with nonexistent MultiXactIds and invisible tuple
Data was collected with crawler throughout a span of 2+ years, and I am totally okay with losing some of it
Now I do backups
There were no relational constraints between tables nor any triggers

Here is the pg_controldata output as of now:

pg_control version number:            942
Catalog version number:               201510051
Database system identifier:           6330224129664261958
Database cluster state:               in production
pg_control last modified:             Thu 08 Dec 2016 01:06:22 AM EET
Latest checkpoint location:           1562/8F9F8A8
Prior checkpoint location:            1562/8F7F460
Latest checkpoint's REDO location:    1562/8F9F8A8
Latest checkpoint's REDO WAL file:    000000010000156200000008
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0/40781255
Latest checkpoint's NextOID:          67798231
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:        615
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  0
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint:            Thu 08 Dec 2016 01:06:22 AM EET
Fake LSN counter for unlogged rels:   0/1
Minimum recovery ending location:     0/0
Min recovery ending loc's timeline:   0
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
wal_level setting:                    minimal
wal_log_hints setting:                off
max_connections setting:              100
max_worker_processes setting:         8
max_prepared_xacts setting:           0
max_locks_per_xact setting:           64
track_commit_timestamp setting:       off
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Size of a large-object chunk:         2048
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value
Data page checksum version:           0

Updates

(9 december 2016) While reading about nonexistent MultiXactIds, I have remembered that my database was not under operational load at the moment of the crash, but it was processing a manual VACUUM request. I have taken the webservers and crawlers offline after I realized that there was only 3% space left on disks. I should have checked /var/log for large files, but I have mistakenly blamed PostgreSQL and tried VACUUM FULL, only to find it aborted due to little space left on device. So I've started ordinary VACUUM and left it at that.
(14 december 2016) Downloaded a 9.5 branch of PostgreSQL sources from Github, commented out blocks in heapam.c and multixact.c and compiled it with hopes that it won't throw these errors. But the server wouldn't start, because it had to be configured with the same flags used in one I had from APT. There were around 47 flags, each requiring a dependency with non-obvious name, so I gave up on that idea.
(16 december 2016) I have found a way to get rid of tuples with invalid timestamps by zeroing out relevant pages. I first set following options in psql:
```
\set FETCH_COUNT 1
\pset pager off
```
I then do SELECT ctid, * FROM comments;. That way it spits out the ctid of a bad tuple before the query dies. I then proceed to fill that page with zeros: dd if=/dev/zero of=/var/lib/postgresql/9.5/main/base/16385/16443 bs=8K seek=92803 count=1 conv=notrunc But each page, zeroed out this way, breaks the previous page, resulting in page 16442 now having a tuple with invalid timestamp. Not sure what I am doing wrong here.

(16 december 2016) Attempting pg_dump -Fc --table photos vw > photos.bak results in segmentation fault after 1.3GB (out of probably 800GB) written. Here is the server log:

2016-12-16 18:48:05 EET [19337-2] LOG:  server process (PID 29088) was terminated by signal 11: Segmentation fault
2016-12-16 18:48:05 EET [19337-3] DETAIL:  Failed process was running: COPY public.photos (id, owner_id, width, height, text, date, link, thumb, album_id, time_found, user_id, lat, long) TO stdout;
2016-12-16 18:48:05 EET [19337-4] LOG:  terminating any other active server processes
2016-12-16 18:48:05 EET [19342-2] WARNING:  terminating connection because of crash of another server process
2016-12-16 18:48:05 EET [19342-3] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2016-12-16 18:48:05 EET [19342-4] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2016-12-16 18:48:05 EET [19337-5] LOG:  all server processes terminated; reinitializing
2016-12-16 18:48:06 EET [29135-1] LOG:  database system was interrupted; last known up at 2016-12-14 22:58:59 EET
2016-12-16 18:48:07 EET [29135-2] LOG:  database system was not properly shut down; automatic recovery in progress
2016-12-16 18:48:07 EET [29135-3] LOG:  invalid record length at 1562/A302F878
2016-12-16 18:48:07 EET [29135-4] LOG:  redo is not required
2016-12-16 18:48:07 EET [29135-5] LOG:  MultiXact member wraparound protections are now enabled
2016-12-16 18:48:07 EET [19337-6] LOG:  database system is ready to accept connections
2016-12-16 18:48:07 EET [29139-1] LOG:  autovacuum launcher started

Here is a short stacktrace:

#0  pglz_decompress (source=source@entry=0x7fbfb6b99b13 "32;00/0ag4d/Jnz\027QI\003Jh3A.jpg", slen=<optimized out>,
    dest=dest@entry=0x7fbf74a0b044 "", rawsize=926905132)
#1  0x00007fc1bf120c12 in toast_decompress_datum (attr=0x7fbfb6b99b0b)
#2  0x00007fc1bf423c83 in text_to_cstring (t=0x7fbfb6b99b0b)

I have no idea how to work around that.

(29 december 2016) I have written a utility which does SELECT * FROM tablename LIMIT 10000 OFFSET 0, incrementing the offset and narrowing down around dead tuples, and it has successfully duplicated data on my local machine except the tuples (I hope the only ones) I have manually corrupted. It is also supposed to wait if the server restarts. However I did not have enough space left on my RAID, and I have created a tablespace slowdisk on an 8TB HDD. When I attempt to CREATE DATABASE vwslow WITH TABLESPACE slowdisk, it won't do so with errors:

2016-12-29 02:34:13 EET [29983-1] LOG:  request to flush past end of generated WAL; request 950412DE/114D59, currpos 1562/A3030C70
2016-12-29 02:34:13 EET [29983-2] CONTEXT:  writing block 58368001 of relation base/16385/16473
2016-12-29 02:34:13 EET [29983-3] ERROR:  xlog flush request 950412DE/114D59 is not satisfied --- flushed only to 1562/A3030C70
2016-12-29 02:34:13 EET [29983-4] CONTEXT:  writing block 58368001 of relation base/16385/16473
2016-12-29 02:34:13 EET [30005-44212] postgres@vw ERROR:  checkpoint request failed
2016-12-29 02:34:13 EET [30005-44213] postgres@vw HINT:  Consult recent messages in the server log for details.
2016-12-29 02:34:13 EET [30005-44214] postgres@vw STATEMENT:  CREATE DATABASE vwslow WITH TABLESPACE slowdisk;

Manual CHECKPOINT resulted in the same errors.

A server restart has made the checkpoint error go away and let me run my tool. Will answer my question and publish the code if it works.

Best Answer

Well, I have managed to automate the recovery process of SELECT and INSERT INTO, skipping ranges and waiting if the server crashes. I have first coded it in Node - it ripped undamaged data from comments, and is still going.

Yesterday I've decided to try Golang, and here is a repo with Go code: https://github.com/kaivi/pg_ripper I will update it soon so it really works around bad tuples, and not just gives up on the whole range containing one.

Related Solutions

Postgresql – Aggressive Autovacuum on PostgreSQL

Eelke is almost certainly correct that your locking is blocking autovacuum. Autovacuum is designed to give way to user activity, deliberately. If those tables are locked, autovacuum cannot vacuum them.

For posterity, however, I wanted to give an example set of settings for hyper-aggressive autovacuum, since the settings you gave don't quite do it. Note that making autovacuum more aggressive is unlikely to solve your problem, however. Also note that the default autovacuum settings are based on running over 200 test runs using DBT2 seeking an optimal combination of settings, so the defaults should be assumed to be good unless you have a solid reason to think otherwise, or unless your database is significantly outside the mainstream for OLTP databases (e.g. a tiny database which gets 10K updates per second, or a 3TB data warehouse).

First, turn on logging so you can check up on whether autovacuum is doing what you think it is:

log_autovacuum_min_duration = 0

Then let's make more autovac workers and have them check tables more often:

autovacuum_max_workers = 6
autovacuum_naptime = 15s

Let's lower the thresholds for auto-vacuum and auto-analyze to trigger sooner:

autovacuum_vacuum_threshold = 25
autovacuum_vacuum_scale_factor = 0.1

autovacuum_analyze_threshold = 10
autovacuum_analyze_scale_factor = 0.05

Then let's make autovacuum less interruptable, so it completes faster, but at the cost of having a greater impact on concurrent user activity:

autovacuum_vacuum_cost_delay = 10ms
autovacuum_vacuum_cost_limit = 1000

There's your full program for generically aggressive autovacuum, which might be apppropriate for a small database getting a very high rate of updates, but might have too great of an impact on concurrent user activity.

Also, note that autovacuum parameters can be adjusted per table, which is almost always a better answer for needing to adjust autovacuum's behavior.

Again, though, it's unlikely to address your real problem.

Postgresql – Getting vacuumdb :invalid page header in block error while vacuuming

The error you pasted indicates that there has been corruption in your database. The best approach to dealing with this kind of corruption is to shut down the database, make a complete copy of all of the files, and then restore them to another system where forensics can be performed.

If you have a complete copy of your database, such that you can retrieve any valid data from it, then you can use the "zero_damaged_pages" configuration parameter in PostgreSQL to have invalid or damaged pages wiped out entirely, allowing the database to progress even in the face of corrupt or invalid data. Note that this option will cause you to lose valid data, as the entire page will be completely wiped out if any piece of it is deemed to be damaged or invalid.

Depending on the specific corruption which has occurred in this database, it may be possible to retrieve some of the data. This retrieval is extremely time consuming and requires expertise knowledge of the database system and is generally only sensible when the data is extremely valuable.