PostgreSQL Backup Failure – pg_dump Bad File Descriptor

pg-dumppostgresqlpostgresql-11

I am running Postgres 11 on Windows 10 and this should be my last backup before moving to Linux:

"C:\PostgreSQL\pg11\bin\pg_dump.exe" -U postgres -h localhost -p 5432 --verbose --jobs=4 --format=d --blobs --clean --create --if-exists --compress=0 --dbname="DATABASE_NAME" --file="P:\PG11_dump_20190217"

This is the end of the log:

...
pg_dump: dumping contents of table "schema.table"
pg_dump: finished item 255955 TABLE DATA table
pg_dump: finished item 255956 TABLE DATA table
pg_dump: pg_dump: dumping contents of table "schema.table"
dumping contents of table "schema.table"
pg_dump: finished item 255953 TABLE DATA table
pg_dump: finished item 255952 TABLE DATA table
pg_dump: could not fsync file "P:\PG11_dump_20190217/291267.dat": Bad file descriptor
pg_dump: could not fsync file "P:\PG11_dump_20190217/293909.dat": Bad file descriptor
pg_dump: could not fsync file "P:\PG11_dump_20190217/294060.dat": Bad file descriptor
pg_dump: could not fsync file "P:\PG11_dump_20190217/295715.dat": Bad file descriptor
pg_dump: could not fsync file "P:\PG11_dump_20190217/296142.dat": Bad file descriptor

I searched for the pg_dump: could not fsync file: Bad file descriptor error but could not find any help.

Maybe of importance: The database is about 3.6 TB and I am using the PostGIS extension.

Question: How to prevent this error or work around it?

Additional question: Could I do a file level backup and copy the data folder from windows to linux, given an identical Postgres version and both servers stopped?

Best Answer

Question: How to prevent this error or work around it?

fsync ensures that the data is physically written to disk, so that in case of a crash or loss or power a few seconds after the backup, no data is lost in unflushed memory buffers. Unfortunately it doesn't seem to work in your case. You may add --no-sync to avoid this step, although the errors shown are probably harmless, in the sense that surely they don't affect the contents of the backup. You may also use sync as an external program if you're wary of skipping the sync.

Additional question: Could I do a file level backup and copy the data folder from windows to linux, given an identical Postgres version and both servers stopped?

No. The data files are not portable in general, and especially not across totally different operating systems.

First action

When facing corruption, before you do ANYTHING ELSE, take a complete file-system-level copy of the damaged database. See: http://wiki.postgresql.org/wiki/Corruption . Failure to do so destroys evidence about what caused the corruption, and means that if your repair efforts go badly and make things worse you can't undo them. Do not attempt any repair first.

You appear to have made the right choice an done that, assuming you really copied the whole datadir. However, you appear to have then messed with the copy. Before you do anything else, make a copy of that damaged datadir somewhere safe and do not touch it again. This is your hope of recovery. Never work on this copy - duplicate it, and test recovery attempts on the duplicate.

Don't trust that server!

BTW, I strongly recommend that you stop using this server:

We had some disk corruption on our server

is not ok. Until you know why that happened you should not be using that server. Retire it or put it aside and get some trustworthy hardware.

If you cannot do that, make absolutely certain you're doing at least daily logical backups and streaming replication with WAL archiving to a secondary server. Treat the faulty server as if it might vanish or eat your data again at any time.

If the disk corruption corresponded with a power failure, it's probably due to unsafe write-back caching or a system that's ignoring disk flush requests. This is why I do plug-pull testing on server deployments, and don't buy cheap SSDs.

Backups?

This is the point where I tell you that you need to restore from those backups you've been making and testing regularly, preferably the point-in-time recovery or streaming replication setup.

If that was an option you wouldn't be posting here, though.

Ask for help in the right place

Once you have a safe copy of the datadir set aside, post for help on the pgsql-general mailing list.

If the data is important and hard to recover, be prepared to pay for data recovery / repair expertise. See http://www.postgresql.org/support/professional_support/ . (I work for one of the listed companies, just by way of fair disclosure).

Corruption cases tend to be somewhat unique and require lots of back-and-forth, so they're not usually a good fit for Stack Overflow.

Internal structure

As for the internals of the files in base/ ... you really need the system catalogs to interpret them usefully. The table structure is documented in PostgreSQL internals.

The structure of individual relation extents is basically a header, followed by a bunch of columns that're interpreted based on the system catalogs. If you've lost the system catalogs you've got no reliable way to tell what each column's type and name is, etc.

The other problem you have is that you've lost the transaction commit logs (pg_clog) that keep a record of open, commited, and rolled back transactions. With that data lost, you will need to do a dirty read of the tables in order to recover any data, because you no longer know which tuples were added by transactions that later rolled back, which are deleted, which are old versions of updated tuples, etc.

Recovery?

... will be very hard.

In theory you might be able to read the tuples out of the heap table extents. I am not aware of any tools to do this. You would need to be able to construct new system catalogs that matched the on-disk structure of the tables, probably with a stand-alone PostgreSQL backend (postgres --single).

I'd like it if PostgreSQL had better recovery options, but frankly, we prefer to have good backups and use streaming replication etc to avoid the need in the first place. Repairing DB corruption is always iffy and results in untrustworthy, mangled data. So in general - don't do that.

It might help if you still have a copy of your data directory from before you ran pg_resetxlog.

Prevention

I wrote a bit on corruption prevention a while ago. See this post on my old blog.

PostgreSQL – pg_restore Error: Schema ‘public’ Already Exists

The error is harmless but to get rid of it, I think you need to break this restore into two commands, as in:

dropdb -U postgres mydb && \
 pg_restore --create --dbname=postgres --username=postgres pg_backup.dump

The --clean option in pg_restore doesn't look like much but actually raises non-trivial problems.

For versions up to 9.1

The combination of --create and --clean in pg_restore options used to be an error in older PG versions (up to 9.1). There is indeed some contradiction between (quoting the 9.1 manpage):

--clean Clean (drop) database objects before recreating them

and

--create Create the database before restoring into it.

Because what's the point of cleaning inside a brand-new database?

Starting from version 9.2

The combination is now accepted and the doc says this (quoting the 9.3 manpage):

--clean Clean (drop) database objects before recreating them. (This might generate some harmless error messages, if any objects were not present in the destination database.)

--create Create the database before restoring into it. If --clean is also specified, drop and recreate the target database before connecting to it.

Now having both together leads to this kind of sequence during your restore:

DROP DATABASE mydb;
...
CREATE DATABASE mydb WITH TEMPLATE = template0... [other options]
...
CREATE SCHEMA public;
...
CREATE TABLE...

There is no DROP for each individual object, only a DROP DATABASE at the beginning. If not using --create this would be the opposite.

Anyway this sequence raises the error of public schema already existing because creating mydb from template0 has imported it already (which is normal, it's the point of a template database).

I'm not sure why this case is not handled automatically by pg_restore. Maybe this would cause undesirable side-effects when an admin decides to customize template0 and/or change the purpose of public, even if we're not supposed to do that.