PostgreSQL cluster won’t start, no log of the issue, message on WAL redo

I'm having trouble with a system that I didn't set up or configure myself, but pieces of the system seem to generally be in more or less default places. My current theory is that the PostgreSQL database on the (Ubuntu 16.04.3 LTS) system stopped working when the storage on the server completely filled up.

Storage has been increased since, some old files removed, but the PostgreSQL service/cluster refuses to start. See below:

ubuntu@ip-<my-ip-here>:~$ sudo pg_ctlcluster 11 main start
Job for [email protected] failed because a configured resource limit was exceeded. See "systemctl status [email protected]" and "journalctl -xe" for details.

ubuntu@ip-<my-ip-here>:~$ systemctl status [email protected]
● [email protected] - PostgreSQL Cluster 11-main
   Loaded: loaded (/lib/systemd/system/[email protected]; disabled; vendor preset: enabled)
   Active: failed (Result: resources) since Mon 2023-01-09 16:24:10 AEST; 7s ago
  Process: 2281 ExecStart=/usr/bin/pg_ctlcluster --skip-systemctl-redirect %i start (code=exited, status=1/FAILURE)

Jan 09 16:24:10 ip-172-31-23-196 postgresql@11-main[2281]: 2023-01-09 16:24:10.052 AEST [2289] CONTEXT:  WAL redo at 7F/200AA78 for XLOG/FPI:
Jan 09 16:24:10 ip-172-31-23-196 postgresql@11-main[2281]: 2023-01-09 16:24:10.053 AEST [2288] LOG:  startup process (PID 2289) exited with exit code 1
Jan 09 16:24:10 ip-172-31-23-196 postgresql@11-main[2281]: 2023-01-09 16:24:10.053 AEST [2288] LOG:  aborting startup due to startup process failure
Jan 09 16:24:10 ip-172-31-23-196 postgresql@11-main[2281]: 2023-01-09 16:24:10.059 AEST [2288] LOG:  database system is shut down
Jan 09 16:24:10 ip-172-31-23-196 postgresql@11-main[2281]: pg_ctl: could not start server
Jan 09 16:24:10 ip-172-31-23-196 postgresql@11-main[2281]: Examine the log output.
Jan 09 16:24:10 ip-172-31-23-196 systemd[1]: [email protected]: Can't open PID file /run/postgresql/11-main.pid (yet?) after start: No such file or directory
Jan 09 16:24:10 ip-172-31-23-196 systemd[1]: Failed to start PostgreSQL Cluster 11-main.
Jan 09 16:24:10 ip-172-31-23-196 systemd[1]: [email protected]: Unit entered failed state.
Jan 09 16:24:10 ip-172-31-23-196 systemd[1]: [email protected]: Failed with result 'resources'.

The Can't open PID file /run/postgresql/11-main.pid (yet?) after start: No such file or directory part seems like a red herring – a problem that occurs after the actual problem has already happened. (also, if I create that file, with rights and ownership for the postgres user, it get removed as expected)

So, I assume the problem is with the CONTEXT: WAL redo at 7F/200AA78 for XLOG/FPI: followed by LOG: startup process (PID 2289) exited with exit code and aborting startup due to startup process failure – something apparently errors out during 'WAL redo'.

The line Examine the log output. sounds wise, but I can't find a log in any of the places I'd expect it:

ubuntu@ip-<my-ip-here>:~$ ls /etc/postgresql/11/main
conf.d  environment  pg_ctl.conf  pg_hba.conf  pg_ident.conf  postgresql.conf  start.conf

And the postgresql.conf doesn't offer any clues on where I'd be able to find it otherwise – so I'm assuming it doesn't get to writing it, perhaps?

Any advice on how to get the server to progress beyond this point is welcome. I wouldn't mind losing whatever data it's trying to consolidate from the WAL redo – I'm mainly interested in getting the system back up and running.

Best Answer

As user @user1686 pointed out correctly, the logs for PostgreSQL on Ubuntu are to be found in /var/log/postgresql, called postgresql-11-main.log as I was interested in the main database and am running PostgreSQL 11.

In that log file the following was recorded:

2023-01-10 10:36:39.271 AEST [11564] LOG:  database system was interrupted while in recovery at 2023-01-09 16:24:10 AEST
2023-01-10 10:36:39.271 AEST [11564] HINT:  This probably means that some data is corrupted and you will have to use the last backup for recovery.
2023-01-10 10:36:39.315 AEST [11564] LOG:  could not open directory "pg_tblspc/17293/lost+found": Permission denied
2023-01-10 10:36:39.319 AEST [11564] LOG:  database system was not properly shut down; automatic recovery in progress
2023-01-10 10:36:39.323 AEST [11564] LOG:  redo starts at 7F/1FA68D0
2023-01-10 10:36:39.335 AEST [11564] FATAL:  could not extend file "pg_tblspc/17293/PG_11_201809051/16386/30550227": No space left on device

This explains why the redo fails, there is no space on the device holding pg_tblspc/17293.

After the following:

sudo ls /var/lib/postgresql/11/main/pg_tblspc -alF
...
lrwxrwxrwx  1 postgres postgres   11 Sep  4  2019 17293 -> /trace_data/

And:

ubuntu@ip-<my-ip-here>:~$ df /trace_data/
Filesystem     1K-blocks      Used Available Use% Mounted on
/dev/xvdf      206293688 196609516         0 100% /trace_data
ubuntu@ip-<my-ip-here>:~$ lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
xvda    202:0    0  100G  0 disk
└─xvda1 202:1    0  100G  0 part /
xvdf    202:80   0  200G  0 disk /trace_data

I realised that I had to increase the size of the xvdf volume, which I was able to do through increasing the size of the volume on the VM, and then grow the file system using sudo resize2fs /dev/xvdf (as it was an ext4 file system). After that, I was able to restart the PostgreSQL cluster, and resolve my further issues.

Best Answer

Related Solutions

Related Question