Postgresql – Recover PostgreSQL database from WAL errors on startup

corruptionpostgresqlUbuntu

I'm trying to set up an OpenStreetMap server on an Ubuntu 12.04 machine using the Ubuntu packages listed at switch2osm.org. I initially installed and set up everything using a Northeast US-only map extract, but now I want to install the entire planet of maps. I downloaded planet-latest.osm.bz2 and ran osm2pgsql --slim -C 60000 planet-latest.osm.bz2 as a user with write permission to the database; this was the same command that worked to install us-northeast.osm.pbf earlier. I came back the next day to find this command appeared to finish successfully, but for some reason the rendering daemon wasn't generating new tiles from the new data. I tried restarting renderd, and when that had no effect I tried restarting the PostgreSQL server with sudo /etc/init.d/postgresql restart. However, server startup failed with the following errors in the log:

2012-07-13 18:54:59 UTC WARNING:  page 1525147 of relation base/16385/477861 was uninitialized
2012-07-13 18:54:59 UTC WARNING:  page 2247965 of relation base/16385/477861 was uninitialized
...500 more lines like this...
2012-07-13 18:54:59 UTC WARNING:  page 2262926 of relation base/16385/477861 was uninitialized
2012-07-13 18:54:59 UTC PANIC:  WAL contains references to invalid pages
2012-07-13 18:55:00 UTC LOG:  startup process (PID 22826) was terminated by signal 6: Aborted

(Pastebin of entire log here).

There isn't much information on these kinds of errors on the Internet, but from what I can find it seems to mean that either my indexes are corrupted or my Write-Ahead-Log is. The only way to fix corrupted indexes, though, is to start the database in single-user mode and rebuild them, and I can't even do that because I get the same fatal errors even when I start in single-user mode with indexing disabled.

Is there any way for me to delete the Write-Ahead Log and force the server to start up "from scratch", or a fix for this kind of corruption that doesn't require first starting the database successfully?

Alternatively, is there a way for me to delete the database and just re-import all the planet data, given that I can't start the server to execute the DROP DATABASE command?

UPDATE:

Following Craig Ringer's suggestion, I went and looked through the database logs from before the WAL errors started occurring to see if I could find any suspicious behavior. In the log from immediately before the first instance of WAL errors, I found these suspicious-looking lines:

2012-07-13 00:20:51 UTC LOG:  received fast shutdown request
2012-07-13 00:20:51 UTC LOG:  aborting any active transactions
2012-07-13 00:20:51 UTC FATAL:  terminating connection due to administrator command
2012-07-13 00:20:51 UTC FATAL:  terminating connection due to administrator command
2012-07-13 00:20:51 UTC FATAL:  terminating connection due to administrator command
2012-07-13 00:20:51 UTC FATAL:  terminating connection due to administrator command
2012-07-13 00:20:54 UTC FATAL:  terminating connection due to administrator command
2012-07-13 00:20:54 UTC STATEMENT:  CREATE TABLE planet_osm_polygon_tmp AS SELECT * 
FROM planet_osm_polygon ORDER BY way;

2012-07-13 00:20:55 UTC FATAL:  terminating connection due to administrator command
2012-07-13 00:20:55 UTC STATEMENT:  CREATE INDEX planet_osm_ways_nodes ON planet_osm_ways 
USING gin (nodes)  WITH (FASTUPDATE=OFF);

2012-07-13 00:20:57 UTC FATAL:  terminating connection due to administrator command
2012-07-13 00:20:57 UTC STATEMENT:  CREATE TABLE planet_osm_line_tmp AS SELECT * 
FROM planet_osm_line ORDER BY way;

2012-07-13 00:21:51 UTC LOG:  received immediate shutdown request
2012-07-13 00:21:52 UTC WARNING:  terminating connection because of crash of another
server process
2012-07-13 00:21:52 UTC DETAIL:  The postmaster has commanded this server process 
to roll back the current transaction and exit, because another server process 
exited abnormally and possibly corrupted shared memory.
2012-07-13 00:21:52 UTC HINT:  In a moment you should be able to reconnect to the 
database and repeat your command.
2012-07-13 00:21:52 UTC LOG:  could not send data to client: Broken pipe
2012-07-13 00:21:58 UTC WARNING:  terminating connection because of crash of another 
server process
2012-07-13 00:21:58 UTC DETAIL:  The postmaster has commanded this server process 
to roll back the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.
2012-07-13 00:21:58 UTC HINT:  In a moment you should be able to reconnect to the
 database and repeat your command.
2012-07-13 00:21:58 UTC LOG:  could not send data to client: Broken pipe

(Pastebin of the entire log is here)

When it says "terminating connection due to administrator command," I assume that was my command to restart the database server. But it looks like the shutdown somehow failed horribly, resulting in corruption of shared memory. This doesn't make sense, because I restarted it "cleanly," using the /etc/init.d/postgres restart script, not an abrupt kill or manually logging in as postgres. Am I interpreting this log incorrectly? Or is there actually a problem with using /etc/init.d/postgres restart to restart a PostgreSQL server?

(Please note, since my question was moved to Database Admin, where I'm a "new user," I no longer have the ability to upvote your answers. This doesn't mean I don't appreciate the help).

Best Answer

UPDATE: it looks like this is a bug in the Debian/Ubuntu packaging of PostgreSQL, where the init scripts - extremely unsafely - kill -9 the postmaster and remove postmaster.pid. See this post on pgsql-general.

See:

Personally, I've gone and edited my init scripts to get rid of this rather hairy and dangerous code.

The original answer

Please go back in the logs to before the restart and see if you can find any errors. WAL corruption absolutely should not happen, so if it has it's important to look into why. If you can upload a copy of the whole log to a pastebin or something that'd be really handy.

The only time where WAL corruption is an accepted possibility with PostgreSQL is if you are running with fsync=off set in PostgreSQL.conf and your system crashes or unexpectedly loses power. If that's not the cause, it'd be really good to look into what happened.

Please do not use pg_resetxlog without some idea why your xlogs are damaged. If the transaction logs become damaged something is badly wrong and you need to find out what. If you band-aid it now, you might be bitten by it later when you care about the data.

The transaction logs exist for a reason and just removing them can leave your tables and indexes in an inconsistent, damaged state. After a pg_resetxlog it's a very good idea to pg_dumpall, drop your cluster, re-initdb, and reload the DB. As I said, though, this should not happen and you should look back in the logs for clues about what could've happened.

Now read the comments