I am using Fedora 15
with PostgreSQL 9.1.4
. Fedora crashed recently after which:
An attempt to start the PostgreSQL server :
service postgresql-9.1 start
gives
Starting postgresql-9.1 (via systemctl): Job failed. See system logs and 'systemctl status' for details.
[FAILED]
Although, the server starts normally when I start the server for the first time after system reboot.
But, an attempt to use psql
gives this error :
psql: could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/tmp/.s.PGSQL.5432"?
.s.PGSQL.5432
file is not present anywhere on the system.
A locate .s.PGSQL.5432
outputs nothing.
The system log has this :
Aug 14 17:31:58 localhost systemd[1]: postgresql-9.1.service: control process exited, code=exited status=1
Aug 14 17:31:58 localhost systemd[1]: Unit postgresql-9.1.service entered failed state.
A
systemctl status postgresql-9.1.service
gives
postgresql-9.1.service - SYSV: PostgreSQL database server.
Loaded: loaded (/etc/rc.d/init.d/postgresql-9.1)
Active: failed since Tue, 14 Aug 2012 17:31:58 +0530; 58s ago
Process: 2811 ExecStop=/etc/rc.d/init.d/postgresql-9.1 stop (code=exited, status=1/FAILURE)
Process: 12423 ExecStart=/etc/rc.d/init.d/postgresql-9.1 start (code=exited, status=1/FAILURE)
Main PID: 2551 (code=exited, status=1/FAILURE)
CGroup: name=systemd:/system/postgresql-9.1.service
I had not changed the default setting of fsync so I am guessing, it was set to on
. I am on a HDD. The HDD crashed.
HDD crash
The HDD crash resulted in running a manual fsck
on a prompt and not gui based. With it repairing gazillion inodes etc.. After which I restarted the system with a Ctrl+Alt+Delete.
PostgreSQL's log has this:
LOG: database system was interrupted; last known up at 2012-08-14 17:31:57 IST
LOG: database system was not properly shut down; automatic recovery in progress
LOG: record with zero length at 0/41A4E58
LOG: redo is not required
FATAL: could not access status of transaction 1
DETAIL: Could not open file "pg_multixact/offsets/0000": No such file or directory.
LOG: startup process (PID 13016) exited with exit code 1
LOG: aborting startup due to startup process failure
Update
Trying to start the server after taking a file system level copy of the /var/lib/pgsql
directory, and running ./pg_resetxlog -f /var/lib/pgsql/9.1/data/
with the result xlog -f /var/lib/pgsql/9.1/data/
still yields in :
LOG: database system was interrupted; last known up at 2012-08-14 18:46:36 IST
LOG: database system was not properly shut down; automatic recovery in progress
LOG: record with zero length at 0/6000078
LOG: redo is not required
FATAL: could not access status of transaction 1
DETAIL: Could not open file "pg_multixact/offsets/0000": No such file or directory.
LOG: startup process (PID 13766) exited with exit code 1
LOG: aborting startup due to startup process failure
Best Answer
The real answer will be in the PostgreSQL logs, in
/var/lib/pgsql/data/pg_log
.However, before you take any action: It is vital that you take a file system level copy of your database before attempting repair if any of your data is valuable to you. See http://wiki.postgresql.org/wiki/Corruption . You must copy the whole data directory. On Fedora that's
/var/lib/pgsql/data
by default, but verify that's correct for your install.Based on the logs you've posted you certainly have some degree of database corruption. The storage that the database is on (the hard drive or file system) is most likely damaged. Take a copy NOW, and put it on a different hard drive or system.
Only once you have made a full file-system level copy of your data directory, try using pg_resetxlog to clear the damaged transaction logs and start your database. Even if it starts it is highly likely to be corrupt; you should
pg_dump
it then re-initdb
it and restore the dump to the fresh instance.If you still can't start it after a
pg_resetxlog
then post an updated log of the startup attempt after resetxlog. It's possible you'll need to start Pg in stand-alone mode with:If that works, giving you a
backend>
prompt, try again after replacing the last "postgres" with the name of the DB you want to connect to. You should be able toSELECT
,COPY
data from tables, etc.If that doesn't work, ie you can't start a standalone backend, then it's probably time to restore from backups - since you're sensible enough to have them. If anyone else reading this is in the same position, contact an experienced PostgreSQL consultant to see if they can recover data from your database. Be prepared to pay for their time and expertise.
Your file system is probably damaged
The severity of the damage to the PostgreSQL install suggests that your whole file system is probably damaged. You may wish to consider restoring the whole system from a backup or reinstalling it.
I would not trust this file system,
fsck
or nofsck
.SMART-test your drive
I also recommend that you run a
SMART
check on your hard drive withsmartctl
from smartmontools; assuming it's/dev/hda
that'd besmartctl -d ata -a /dev/sda | less
. Look for a failed health test,uncorrectable_sectors
, a high read error rate, a reallocated_sector_count of more than 2 or 3, or a non-zero current_pending_sector. Runsmartctl -d ata -t long /dev/sda
to execute a non-destructive self test on your HDD; it won't interrupt normal functioning of the system. When the estimated time has elapsed runsmartctl -d ata /dev/sda
again and look at the self test log to see if it passed.If anything looks less than perfect, replace the drive.
In future, consider automating this testing via
smartd
for early warning of drive failures.(Content in this post was obsoleted by updates to the question. If you're troubleshooting a similar problem, look at this answer's edit history).