Postgresql – SSD used for tablespace died. How to recover data

corruptiondisaster recoveryhardwarepostgresql

I have a postgres cluster with two databases

  • Database A uses the default tablespace. It holds important information but has very few writes per day (< 20) and only has a few tables with a few thousand rows of data.

  • Database B is on its own tablespace on a separate SSD. It has hundreds of GBs of data and adds millions of rows per day. The data is for analytics and is not important.

Recently the SSD holding Database B's tablespace died. Postgres will no longer start up. My priority is dumping the data from Database A.

I was thinking that because Database A has few writes or deletes per day, it would be fairly safe to run pg_resetwal, then dump Database A. After dumping Database A I would re-install postgres and load the dumped data from Database A.

Is there an alternative method to recover my data. Are there any obvious problems with my plan?

(I know that pg_barman should be used to prevent these problems, but my client has refused when I suggested pg_barman after a similar failure in the past. And, yes, RAID would obviously be better than a single SSD, but I don't get to make the hardware decisions)

Error log when attempting to start postgres:

2020-03-13 16:20:16.678 PDT [55834] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-03-13 16:20:16.686 PDT [55834] LOG:  could not open directory "pg_tblspc/16394/PG_11_201809051": No such file or directory
2020-03-13 16:20:16.688 PDT [57190] LOG:  database system was interrupted while in recovery at 2020-03-13 16:19:29 PDT
2020-03-13 16:20:16.688 PDT [57190] HINT:  This probably means that some data is corrupted and you will have to use the last backup for recovery.
2020-03-13 16:20:17.016 PDT [57190] LOG:  could not stat file "pg_tblspc/16394": No such file or directory
2020-03-13 16:20:17.134 PDT [57190] LOG:  could not open directory "pg_tblspc/16394/PG_11_201809051": No such file or directory
2020-03-13 16:20:17.135 PDT [57190] LOG:  database system was not properly shut down; automatic recovery in progress
2020-03-13 16:20:17.135 PDT [57190] LOG:  could not open directory "pg_tblspc/16394/PG_11_201809051": No such file or directory
2020-03-13 16:20:17.136 PDT [57190] LOG:  redo starts at 7C1/93EB0EA0
2020-03-13 16:20:17.136 PDT [57190] FATAL:  could not create directory "pg_tblspc/16394/PG_11_201809051": No such file or directory
2020-03-13 16:20:17.136 PDT [57190] CONTEXT:  WAL redo at 7C1/93EB0EA0 for Sequence/LOG: rel 16394/26819/26877
2020-03-13 16:20:17.136 PDT [55834] LOG:  startup process (PID 57190) exited with exit code 1
2020-03-13 16:20:17.136 PDT [55834] LOG:  aborting startup due to startup process failure
2020-03-13 16:20:17.139 PDT [55834] LOG:  database system is shut down

Posgresql version is 11.2

Best Answer

In my hands, just creating an empty tablespace directory allows recovery to proceed and the database to open. Either change your mounts so that the existing path of the tablespace now maps to a valid directory again, or delete and re-create the symlink "PGDATA/pg_tblspc/16394" so that it points to a valid empty directory on some other mountpoint.

I think that this should much safer than monkeying around with pg_resetwal.