Postgresql – Why does WAL on postgresql takes so long to replay

postgresql

Context:

I'm running postgresql 11 using the official postgresql docker image. The configuration is the default one without any tweak. Sometimes, my docker server crashes and kills all the container with it. As a result the postgresql gets stopped in a dirty way and postgresql has to replay the WALs. It takes more than 1h to replay the logs thought the server is mostly idling. I mean when the server crashed it hardly had any writes in the logs. And when I check in pg_wal I only see 65mb of data in the directory so it shouldn't take so much time to replay.

Here's some configurations:

max_wal_size: 1gb
min_wal_size: 80gb
checkpoint_timeout: 3min

So how come it take so long to restart? I have a different server with much less databases but similar configurations. Both are mostly idle all the time but the other one quickly restart while they have the same configuration.

Best Answer

Docker shouldn't be crashing all the time. Maybe whatever is making it do that is also making it very slow when it isn't crashed. Look in the log files for the OS and for docker (and for PostgreSQL) to see what they say.

Find the pid of the startup process, and do:

strace -y -ttt -T -p <pid>

And see what it is doing. This will show you, for example, if it is getting stalled on reads or writes or other system calls. If it isn't, you can also use gdb -p <pid> to see what it is doing internally. To get the most out of that, you will have to install the debug symbols, which I don't know how to do on the official docker image.