Postgresql 11 – service doesn’t start

postgresql

I use PG Admin 4 with PostgreSQL 11 for my data. After some kind of crash yesterday, I am not able to start postgresql anymore. I've done some research, but it doesn't look like a localhost, IP address, or port issue.

enter image description here

The log file states (in bold what I think is relevant):

2021-03-21 06:49:36.682 AEDT [11796] LOG:  database system was shut down at 2021-03-20 14:56:05 AEDT
2021-03-21 06:49:36.682 AEDT [11796] LOG:  invalid record length at 5B1/CA7F5900: wanted 24, got 0
2021-03-21 06:49:36.682 AEDT [11796] **LOG:  invalid primary checkpoint record**
2021-03-21 06:49:36.682 AEDT [11796] **PANIC:  could not locate a valid checkpoint record**
2021-03-21 06:49:36.890 AEDT [13136] LOG:  startup process (PID 11796) was terminated by exception 0xC0000409
2021-03-21 06:49:36.890 AEDT [13136] HINT:  See C include file "ntstatus.h" for a description of the hexadecimal value.
2021-03-21 06:49:36.890 AEDT [13136] LOG:  aborting startup due to startup process failure
2021-03-21 06:49:36.898 AEDT [13136] LOG:  database system is shut down

I have checked the postgresql.conf and pg_hba.conf files. They look fine and unchanged (allowing all users, having the right IP address, and port number).

postgresql.conf

# - Connection Settings -

listen_addresses = '*'      # what IP address(es) to listen on;
                    # comma-separated list of addresses;
                    # defaults to '*'; use '*' for all
                    # (change requires restart)
port = 5432             # (change requires restart)
max_connections = 100           # (change requires restart)
#superuser_reserved_connections = 3 # (change requires restart)
#unix_socket_directories = ''   # comma-separated list of directories
                    # (change requires restart)
#unix_socket_group = ''         # (change requires restart)
#unix_socket_permissions = 0777     # begin with 0 to use octal notation
                    # (change requires restart)
#bonjour = off              # advertise server via Bonjour
                    # (change requires restart)
#bonjour_name = ''          # defaults to the computer name
                    # (change requires restart)

# - TCP Keepalives -
# see "man 7 tcp" for details

#tcp_keepalives_idle = 0        # TCP_KEEPIDLE, in seconds;
                    # 0 selects the system default
#tcp_keepalives_interval = 0        # TCP_KEEPINTVL, in seconds;
                    # 0 selects the system default
#tcp_keepalives_count = 0       # TCP_KEEPCNT;
                    # 0 selects the system default

# - Authentication -

#authentication_timeout = 1min      # 1s-600s
#password_encryption = md5      # md5 or scram-sha-256
#db_user_namespace = off

**pg_hba.conf**
# IPv4 local connections:
host    all             all             127.0.0.1/32            md5
# IPv6 local connections:
host    all             all             ::1/128                 md5

My SSD hard drive, which contains the data directory of postgres seems fine (i.e. I can read / write and access all files). In an effort to fix my starting issue, I was trying to run pg_resetwal but unsuccessful because the program did not open although the postgres service stopped.

From the Windows logs:

Waiting for server startup...
LOG:  listening on IPv6 address "::", port 5432
LOG:  listening on IPv4 address "0.0.0.0", port 5432
LOG:  redirecting log output to logging collector process
HINT:  Future log output will appear in directory "log".

pgAdmin gives me the following (obviouldy I can't start postgresql…):

could not connect to server: Connection refused (0x0000274D/10061) 
Is the server running on host "localhost" (127.0.0.1) and accepting TCP/IP 
connections on port 5432? 

Do you have any idea what could be wrong here? Many thanks!

Best Answer

I'm sorry to have to say to you that it seems like you have corruption in your WAL files. Did you remove some WAL files? If so, copy them back to the WAL directory.

If you don't have the missing WAL files (and I see archiving is disabled, so you likely don't have them), data loss is inevitable.

My suggestion:

  • copy the entire cluster somewhere else (just to be able to restore should you make things worse)
  • use pg_resetwal to try to reset your wals (see documentation here and please read the whole page so you understand that this will lead to data loss)
  • try to start again
  • if it starts, if you don't have checksums enables (it's not the default), pg_dumpall everything (see documentation here) (redirect to /dev/null if it's too big) to check for any data corruption
  • if it does not start, I'm afraid you'll have to rely on backup, making the data loss even worse

I'm so sorry for the bad news...