Postgresql – Unable to Restart PostgreSQL after attempting to move pg_xlog to different partition

centos-7postgresql

I am having problems getting PostgreSQL back up and running on my CentOS 7 server.

Some Context:
After noticing that I was quickly running out of storage space on the /root partition of the server, I did some quick digging around and found that the culprit was all the files in pg_xlog not being deleted. Since the /home has lots of free space, I elected to move the pg_xlog directory to a new directory on /home in order to buy enough time to solve the initial problem of the log files not being cleared. I then made the original pg_xlog a symbolic link to the new one in the /home partition on some advice I found here.

But that resulted in some weird errors with SELinux not allowing Postgres to read from the new directory even after chown-ing to postgres:postgres (both the symlink and the new directory).

Since I am not all well versed in SELinux I decided to just copy the log files back to their original directory and look into physically extending the partition.

But after re-copying back all the files, PostgreSQL won't start up. The error messages don't seem to be very helpful (at least not to me).
Journalctl -xe is reporting:

Apr 19 00:12:29 localhost.localdomain systemd[1]: Unit postgresql.service entered failed state.
Apr 19 00:12:29 localhost.localdomain systemd[1]: postgresql.service failed.
Apr 19 00:12:29 localhost.localdomain polkitd[1105]: Unregistered >Authentication Agent for unix-process:17236:1124444 (system bus name :1.588, object path /org/freedesk
Apr 19 00:12:31 localhost.localdomain abrt-server[17258]: Email address of sender was not specified. Would you like to do so now? If not, 'user@localhost' is to be used
Apr 19 00:12:31 localhost.localdomain abrt-server[17258]: Email address of receiver was not specified. Would you like to do so now? If not, 'root@localhost' is to be us
Apr 19 00:12:31 localhost.localdomain abrt-server[17258]: Sending an e-mail...
Apr 19 00:12:31 localhost.localdomain abrt-server[17258]: Sending a notification email to: root@localhost
Apr 19 00:12:31 localhost.localdomain abrt-server[17258]: Email was sent to: root@localhost
Apr 19 00:12:31 localhost.localdomain postfix/pickup[10305]: AA0676622C: uid=0 from=<user@localhost>
Apr 19 00:12:31 localhost.localdomain postfix/cleanup[17283]: AA0676622C: message-id=<5ad7b4bf.TazSMT+2CbgyTlql%user@localhost>
Apr 19 00:12:31 localhost.localdomain postfix/qmgr[1963]: AA0676622C: from=<user@localhost.localdomain>, size=45585, nrcpt=1 (queue active)
Apr 19 00:12:31 localhost.localdomain postfix/local[17285]: AA0676622C: to=<root@localhost.localdomain>, orig_to=<root@localhost>, relay=local, delay=0.05, delays=0.04/
Apr 19 00:12:31 localhost.localdomain postfix/qmgr[1963]: AA0676622C: removed

And systemctl status postgresql says:

● postgresql.service – PostgreSQL database server Loaded: loaded
(/usr/lib/systemd/system/postgresql.service; enabled; vendor preset:
disabled) Active: failed (Result: exit-code) since Thu 2018-04-19
00:12:29 EAT; 1min 34s ago Process: 17252 ExecStart=/usr/bin/pg_ctl
start -D ${PGDATA} -s -o -p ${PGPORT} -w -t 300 (code=exited,
status=1/FAILURE) Process: 17243
ExecStartPre=/usr/bin/postgresql-check-db-dir ${PGDATA} (code=exited,
status=0/SUCCESS)

Apr 19 00:12:28 localhost.localdomain systemd1: Starting PostgreSQL
database server… Apr 19 00:12:29 localhost.localdomain systemd1:
postgresql.service: control process exited, code=exited status=1 Apr
19 00:12:29 localhost.localdomain systemd1: Failed to start
PostgreSQL database server. Apr 19 00:12:29 localhost.localdomain
systemd1: Unit postgresql.service entered failed state. Apr 19
00:12:29 localhost.localdomain systemd1: postgresql.service failed.

I must confess I am at my wits' end. Any help, even if it's just a way to generate more helpful error messages would be appreciated.

EDIT:

It appears that during the copying back of the pg_xlog files the ownership of the files changed to root. I have since re-copied those same files again with rsync in order to preserve permissions.
And now I am including PostgreSQL logs as advised in comment section below. The error message in the PostgreSQL logs is:

WARNING:  transaction log file "00000001000000470000008D" could not be archived: too many failures
LOG:  archive command failed with exit code 1
DETAIL:  The failed archive command was: false
LOG:  archive command failed with exit code 1
DETAIL:  The failed archive command was: false
LOG:  archive command failed with exit code 1
DETAIL:  The failed archive command was: false
WARNING:  transaction log file "00000001000000470000008D" could not be archived: too many failures
LOG:  database system was interrupted; last known up at 2018-04-18 19:21:53 EAT
PANIC:  could not open file "pg_xlog/000000010000006900000017" (log file 105, segment 23): Permission denied
LOG:  startup process (PID 17256) was terminated by signal 6: Aborted
LOG:  aborting startup due to startup process failure
LOG:  database system was interrupted; last known up at 2018-04-18 19:21:53 EAT
PANIC:  could not open file "pg_xlog/000000010000006900000017" (log file 105, segment 23): Permission denied
LOG:  startup process (PID 20020) was terminated by signal 6: Aborted
LOG:  aborting startup due to startup process failure
LOG:  database system was interrupted; last known up at 2018-04-18 19:21:53 EAT
FATAL:  the database system is starting up
PANIC:  could not open file "pg_xlog/000000010000006900000017" (log file 105, segment 23): Permission denied
LOG:  startup process (PID 25607) was terminated by signal 6: Aborted
LOG:  aborting startup due to startup process failure
LOG:  database system was interrupted; last known up at 2018-04-18 19:21:53 EAT
FATAL:  the database system is starting up
FATAL:  the database system is starting up
LOG:  invalid magic number 0000 in log file 105, segment 23, offset 9617408
LOG:  invalid primary checkpoint record
LOG:  invalid magic number 0000 in log file 105, segment 23, offset 9601024
LOG:  invalid secondary checkpoint record
PANIC:  could not locate a valid checkpoint record
FATAL:  the database system is starting up
LOG:  startup process (PID 28108) was terminated by signal 6: Aborted
LOG:  aborting startup due to startup process failure
LOG:  database system was interrupted; last known up at 2018-04-18 19:21:53 EAT
LOG:  invalid magic number 0000 in log file 105, segment 23, offset 9617408
LOG:  invalid primary checkpoint record
LOG:  invalid magic number 0000 in log file 105, segment 23, offset 9601024
LOG:  invalid secondary checkpoint record
PANIC:  could not locate a valid checkpoint record
LOG:  startup process (PID 28529) was terminated by signal 6: Aborted
LOG:  aborting startup due to startup process failure

Best Answer

LOG: invalid magic number 0000 in log file 105, segment 23, offset 9617408

That looks pretty bad. It looks like the xlog files were somehow zeroed out when you copied them back and forth. You may need to engage a professional services firm in PostgreSQL data recovery. You should take a back up of all data you can still find and put it somewhere where it can't be altered or deleted. Include both the ones on the /root partition, and whatever is left on /home, in case the problem occurred when copying from /home back to /root.

I believe the xlog file in questions would be 000000xx0000006900000017 (where the xx could be any hex digits). It would be interesting to know if that file is all or mostly zeros.

Also, do you have the server log files from the first time you attempted to start the server? Or, the first PANIC to have occured in the log file?

Related Solutions

PostgreSQL 9.1 Hot Backup Error: the database system is starting up

The message "The database system is starting up." does not indicate an error. The reason it is at the FATAL level is so that it will always make it to the log, regardless of the setting of log_min_messages:

http://www.postgresql.org/docs/9.1/interactive/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-WHEN

After the rsync, did you really run what you show?:

pgsql -c "select pg_stop_backup();";

Since there is, so far as I know, no pgsql executable, that would leave the backup uncompleted, and the slave would never come out of recovery mode. On the other hand, maybe you really did run psql, because otherwise I don't see how the slave would have logged such success messages as:

Log: consistent recovery state reached at 0/BF0000B0

and:

Log: streaming replication successfully connected to primary

Did you try connecting to the slave at this point? What happened?

The "Success. You can now start..." message you mention is generated by initdb, which shouldn't be run as part of setting up a slave; so I think you may be confused about something there. I'm also concerned about these apparently conflicting statements:

The only ways I have restarted Postgres is through the service postgresql-9.1 restart or /etc/init.d/postgresql-9.1 restart commands. After I receive this error, I kill all processes and again try to restart the database...

Did you try to stop the service through the service script? What happened? It might help in understanding the logs if you prefixed lines with more information. We use:

log_line_prefix = '[%m] %p %q<%u %d %r> '

The recovery.conf script looks odd. Are you copying from the master's pg_xlog directory, the slave's active pg_xlog directory, or an archive directory?

Postgresql – Can’t start postgres server, SHMMAX too small

This question has been asked many times. Try a search. Or Google.

This chapter of the manual deals with it comprehensively.

Best Answer

Related Solutions

PostgreSQL 9.1 Hot Backup Error: the database system is starting up

Postgresql – Can’t start postgres server, SHMMAX too small

Related Question