Postgresql – SSD used for tablespace died. How to recover data

corruptiondisaster recoveryhardwarepostgresql

I have a postgres cluster with two databases

Database A uses the default tablespace. It holds important information but has very few writes per day (< 20) and only has a few tables with a few thousand rows of data.
Database B is on its own tablespace on a separate SSD. It has hundreds of GBs of data and adds millions of rows per day. The data is for analytics and is not important.

Recently the SSD holding Database B's tablespace died. Postgres will no longer start up. My priority is dumping the data from Database A.

I was thinking that because Database A has few writes or deletes per day, it would be fairly safe to run pg_resetwal, then dump Database A. After dumping Database A I would re-install postgres and load the dumped data from Database A.

Is there an alternative method to recover my data. Are there any obvious problems with my plan?

(I know that pg_barman should be used to prevent these problems, but my client has refused when I suggested pg_barman after a similar failure in the past. And, yes, RAID would obviously be better than a single SSD, but I don't get to make the hardware decisions)

Error log when attempting to start postgres:

2020-03-13 16:20:16.678 PDT [55834] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-03-13 16:20:16.686 PDT [55834] LOG:  could not open directory "pg_tblspc/16394/PG_11_201809051": No such file or directory
2020-03-13 16:20:16.688 PDT [57190] LOG:  database system was interrupted while in recovery at 2020-03-13 16:19:29 PDT
2020-03-13 16:20:16.688 PDT [57190] HINT:  This probably means that some data is corrupted and you will have to use the last backup for recovery.
2020-03-13 16:20:17.016 PDT [57190] LOG:  could not stat file "pg_tblspc/16394": No such file or directory
2020-03-13 16:20:17.134 PDT [57190] LOG:  could not open directory "pg_tblspc/16394/PG_11_201809051": No such file or directory
2020-03-13 16:20:17.135 PDT [57190] LOG:  database system was not properly shut down; automatic recovery in progress
2020-03-13 16:20:17.135 PDT [57190] LOG:  could not open directory "pg_tblspc/16394/PG_11_201809051": No such file or directory
2020-03-13 16:20:17.136 PDT [57190] LOG:  redo starts at 7C1/93EB0EA0
2020-03-13 16:20:17.136 PDT [57190] FATAL:  could not create directory "pg_tblspc/16394/PG_11_201809051": No such file or directory
2020-03-13 16:20:17.136 PDT [57190] CONTEXT:  WAL redo at 7C1/93EB0EA0 for Sequence/LOG: rel 16394/26819/26877
2020-03-13 16:20:17.136 PDT [55834] LOG:  startup process (PID 57190) exited with exit code 1
2020-03-13 16:20:17.136 PDT [55834] LOG:  aborting startup due to startup process failure
2020-03-13 16:20:17.139 PDT [55834] LOG:  database system is shut down

Posgresql version is 11.2

Best Answer

In my hands, just creating an empty tablespace directory allows recovery to proceed and the database to open. Either change your mounts so that the existing path of the tablespace now maps to a valid directory again, or delete and re-create the symlink "PGDATA/pg_tblspc/16394" so that it points to a valid empty directory on some other mountpoint.

I think that this should much safer than monkeying around with pg_resetwal.

Related Solutions

PostgreSQL 9.1 Hot Backup Error: the database system is starting up

The message "The database system is starting up." does not indicate an error. The reason it is at the FATAL level is so that it will always make it to the log, regardless of the setting of log_min_messages:

http://www.postgresql.org/docs/9.1/interactive/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-WHEN

After the rsync, did you really run what you show?:

pgsql -c "select pg_stop_backup();";

Since there is, so far as I know, no pgsql executable, that would leave the backup uncompleted, and the slave would never come out of recovery mode. On the other hand, maybe you really did run psql, because otherwise I don't see how the slave would have logged such success messages as:

Log: consistent recovery state reached at 0/BF0000B0

and:

Log: streaming replication successfully connected to primary

Did you try connecting to the slave at this point? What happened?

The "Success. You can now start..." message you mention is generated by initdb, which shouldn't be run as part of setting up a slave; so I think you may be confused about something there. I'm also concerned about these apparently conflicting statements:

The only ways I have restarted Postgres is through the service postgresql-9.1 restart or /etc/init.d/postgresql-9.1 restart commands. After I receive this error, I kill all processes and again try to restart the database...

Did you try to stop the service through the service script? What happened? It might help in understanding the logs if you prefixed lines with more information. We use:

log_line_prefix = '[%m] %p %q<%u %d %r> '

The recovery.conf script looks odd. Are you copying from the master's pg_xlog directory, the slave's active pg_xlog directory, or an archive directory?

Postgresql – Recover from Postgres FATAL: could not open file pg_tblspc/

The very first thing to do is to make a copy of whatever data files you still have, and to keep it and any backups safe until long after your recovery effort is complete. Please read this (short) Wiki page:

http://wiki.postgresql.org/wiki/Corruption

Once you have done that, you can attempt various recovery strategies without fear that you will be worse off for the attempt, beyond the time required to try it. In general I recommend carefully following one of the techniques described in the documentation -- attempts to cut corners or to be creative often lead to corruption. Only a seasoned expert with a good understanding of PostgreSQL internals should attempt to deviate from the documented steps.

You didn't describe your backup strategy; details of what is available there may suggest alternatives you would not otherwise have.

Ultimately, if you have data of value which is not backed up, you may need to hand-edit the system tables to eliminate references to lost tablespace. This is not for the faint of heart. There are a number of companies with which you can contract for such services, many of whom have experience with recovery from catastrophic hardware failure like this.

http://www.postgresql.org/support/professional_support/

I am not affiliated with any of these companies.

Best Answer

Related Solutions

PostgreSQL 9.1 Hot Backup Error: the database system is starting up

Postgresql – Recover from Postgres FATAL: could not open file pg_tblspc/

Related Question