PostgreSQL – Replica Recovery Process Stuck frequently (epoll_wait taking too much time)

master-slave-replicationpostgresql

We have Master Slave Setup with Asynchronous Streaming Replication (Replication Slots). Frequently our Replica's entire Postgres processes seems to be doing nothing (or stuck?). Since replication_slots is used for replication, this has been affecting the master with pg_xlog getting backlogged.

postgres=# select version();
                                                               version                                                                
--------------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 9.6.11 on x86_64-pc-linux-gnu (Ubuntu 9.6.11-1.pgdg18.04+1), compiled by gcc (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0, 64-bit
(1 row)

As per Slave postgres log, last received wal is processed and the next restartpoint is not starting. No logs after this. Just simply stuck. After restarting, replica starts to catch up, but it again gets stuck after some time.

2018-11-23 12:47:26.625 UTC [4193] LOG:  restartpoint starting: xlog
2018-11-23 12:49:48.258 UTC [4193] LOG:  restartpoint complete: wrote 10057 buffers (10.3%); 0 transaction log file(s) added, 198 removed, 0 recycled; write=130.568 s, sync=0.702 s, total=141.633 s; sync files=1874, longest=0.205 s, average=0.000 s; distance=5435211 kB, estimate=5435211 kB
2018-11-23 12:49:48.258 UTC [4193] LOG:  recovery restart point at 1143/8B16D800
2018-11-23 12:49:48.258 UTC [4193] DETAIL:  last completed transaction was at log time 2018-11-23 07:00:19.510771+00
2018-11-23 12:51:04.092 UTC [4193] LOG:  restartpoint starting: xlog
2018-11-23 12:55:34.744 UTC [4193] LOG:  restartpoint complete: wrote 3086 buffers (3.2%); 0 transaction log file(s) added, 120 removed, 212 recycled; write=269.875 s, sync=0.029 s, total=270.652 s; sync files=4955, longest=0.000 s, average=0.000 s; distance=3238852 kB, estimate=5215575 kB
2018-11-23 12:55:34.744 UTC [4193] LOG:  recovery restart point at 1144/50C5EA00
2018-11-23 12:55:34.744 UTC [4193] DETAIL:  last completed transaction was at log time 2018-11-23 07:04:51.819643+00

I analysed trace of Slave Postgres Recovery process during both active and stuck mode, and seems that during stuck mode, recovery process is doing very minimal activity (or stuck).

perf trace -p pid of Slave Recovery Process

         ? (     ?   ):  ... [continued]: epoll_wait()) = 0
     0.048 ( 0.023 ms): close(fd: 992<anon_inode:[eventpoll]>                                 ) = 0
     0.100 ( 0.022 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec83e07, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
     0.146 ( 0.024 ms): epoll_create1(flags: 524288                                           ) = 992
     0.192 ( 0.025 ms): epoll_ctl(epfd: 992<anon_inode:[eventpoll]>, op: ADD, fd: 12<pipe:[1568078]>, event: 0x7fff2ec83d7c) = 0
     0.240 ( 0.023 ms): epoll_ctl(epfd: 992<anon_inode:[eventpoll]>, op: ADD, fd: 8<pipe:[1567713]>, event: 0x7fff2ec83d7c) = 0
     0.284 ( 0.020 ms): gettimeofday(tv: 0x7fff2ec83d50                                       ) = 0
     0.323 (5003.804 ms): epoll_wait(epfd: 992<anon_inode:[eventpoll]>, events: 0x55fb9372ae08, maxevents: 1, timeout: 5000) = 0
  5004.184 ( 0.023 ms): close(fd: 992<anon_inode:[eventpoll]>                                 ) = 0
  5004.233 ( 0.023 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec83e07, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
  5004.281 ( 0.025 ms): epoll_create1(flags: 524288                                           ) = 992
  5004.329 ( 0.025 ms): epoll_ctl(epfd: 992<anon_inode:[eventpoll]>, op: ADD, fd: 12<pipe:[1568078]>, event: 0x7fff2ec83d7c) = 0
  5004.374 ( 0.021 ms): epoll_ctl(epfd: 992<anon_inode:[eventpoll]>, op: ADD, fd: 8<pipe:[1567713]>, event: 0x7fff2ec83d7c) = 0
  5004.416 ( 0.033 ms): gettimeofday(tv: 0x7fff2ec83d50                                       ) = 0
  5004.465 (5005.223 ms): epoll_wait(epfd: 992<anon_inode:[eventpoll]>, events: 0x55fb9372ae08, maxevents: 1, timeout: 5000) = 0
 10009.736 ( 0.025 ms): close(fd: 992<anon_inode:[eventpoll]>                                 ) = 0
 10009.788 ( 0.024 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec83e07, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
 10009.838 ( 0.025 ms): epoll_create1(flags: 524288                                           ) = 992
 10009.887 ( 0.075 ms): epoll_ctl(epfd: 992<anon_inode:[eventpoll]>, op: ADD, fd: 12<pipe:[1568078]>, event: 0x7fff2ec83d7c) = 0
 10010.004 ( 0.025 ms): epoll_ctl(epfd: 992<anon_inode:[eventpoll]>, op: ADD, fd: 8<pipe:[1567713]>, event: 0x7fff2ec83d7c) = 0
 10010.114 ( 0.032 ms): gettimeofday(tv: 0x7fff2ec83d50                                       ) = 0
 10010.205 (5005.241 ms): epoll_wait(epfd: 992<anon_inode:[eventpoll]>, events: 0x55fb9372ae08, maxevents: 1, timeout: 5000) = 0
 15015.499 ( 0.024 ms): close(fd: 992<anon_inode:[eventpoll]>                                 ) = 0
 15015.554 ( 0.024 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec83e07, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
 15015.604 ( 0.028 ms): epoll_create1(flags: 524288                                           ) = 992
 15015.692 ( 0.104 ms): epoll_ctl(epfd: 992<anon_inode:[eventpoll]>, op: ADD, fd: 12<pipe:[1568078]>, event: 0x7fff2ec83d7c) = 0
 15015.824 ( 0.023 ms): epoll_ctl(epfd: 992<anon_inode:[eventpoll]>, op: ADD, fd: 8<pipe:[1567713]>, event: 0x7fff2ec83d7c) = 0
 15015.871 ( 0.025 ms): gettimeofday(tv: 0x7fff2ec83d50                                       ) = 0
 15015.919 (5000.618 ms): epoll_wait(epfd: 992<anon_inode:[eventpoll]>, events: 0x55fb9372ae08, maxevents: 1, timeout: 5000) = 0
 20016.588 ( 0.022 ms): close(fd: 992<anon_inode:[eventpoll]>                                 ) = 0
 20016.641 ( 0.024 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec83e07, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
 20016.691 ( 0.027 ms): epoll_create1(flags: 524288                                           ) = 992
 20016.744 ( 0.022 ms): epoll_ctl(epfd: 992<anon_inode:[eventpoll]>, op: ADD, fd: 12<pipe:[1568078]>, event: 0x7fff2ec83d7c) = 0
 20016.787 ( 0.021 ms): epoll_ctl(epfd: 992<anon_inode:[eventpoll]>, op: ADD, fd: 8<pipe:[1567713]>, event: 0x7fff2ec83d7c) = 0
 20016.831 ( 0.020 ms): gettimeofday(tv: 0x7fff2ec83d50                                       ) = 0
 20016.867 (5005.231 ms): epoll_wait(epfd: 992<anon_inode:[eventpoll]>, events: 0x55fb9372ae08, maxevents: 1, timeout: 5000) = 0
 25022.142 ( 0.023 ms): close(fd: 992<anon_inode:[eventpoll]>                                 ) = 0
 25022.196 ( 0.024 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec83e07, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
 25022.246 ( 0.028 ms): epoll_create1(flags: 524288                                           ) = 992
 25022.296 ( 0.023 ms): epoll_ctl(epfd: 992<anon_inode:[eventpoll]>, op: ADD, fd: 12<pipe:[1568078]>, event: 0x7fff2ec83d7c) = 0
 25022.340 ( 0.023 ms): epoll_ctl(epfd: 992<anon_inode:[eventpoll]>, op: ADD, fd: 8<pipe:[1567713]>, event: 0x7fff2ec83d7c) = 0
 25022.388 ( 0.020 ms): gettimeofday(tv: 0x7fff2ec83d50                                       ) = 0
 25022.428 (5005.228 ms): epoll_wait(epfd: 992<anon_inode:[eventpoll]>, events: 0x55fb9372ae08, maxevents: 1, timeout: 5000) = 0
 30027.705 ( 0.025 ms): close(fd: 992<anon_inode:[eventpoll]>                                 ) = 0
 30027.761 ( 0.025 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec83e07, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
 30027.811 ( 0.028 ms): epoll_create1(flags: 524288                                           ) = 992
 30027.864 ( 0.022 ms): epoll_ctl(epfd: 992<anon_inode:[eventpoll]>, op: ADD, fd: 12<pipe:[1568078]>, event: 0x7fff2ec83d7c) = 0
 30027.907 ( 0.020 ms): epoll_ctl(epfd: 992<anon_inode:[eventpoll]>, op: ADD, fd: 8<pipe:[1567713]>, event: 0x7fff2ec83d7c) = 0
 30027.949 ( 0.022 ms): gettimeofday(tv: 0x7fff2ec83d50                                       ) = 0
 30027.988 (5005.233 ms): epoll_wait(epfd: 992<anon_inode:[eventpoll]>, events: 0x55fb9372ae08, maxevents: 1, timeout: 5000) = 0
 35033.275 ( 0.025 ms): close(fd: 992<anon_inode:[eventpoll]>                                 ) = 0
 35033.333 ( 0.024 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec83e07, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
 35033.382 ( 0.029 ms): epoll_create1(flags: 524288                                           ) = 992
 35033.435 ( 0.077 ms): epoll_ctl(epfd: 992<anon_inode:[eventpoll]>, op: ADD, fd: 12<pipe:[1568078]>, event: 0x7fff2ec83d7c) = 0
 35033.588 ( 0.046 ms): epoll_ctl(epfd: 992<anon_inode:[eventpoll]>, op: ADD, fd: 8<pipe:[1567713]>, event: 0x7fff2ec83d7c) = 0
 35033.663 ( 0.023 ms): gettimeofday(tv: 0x7fff2ec83d50                                       ) = 0
 35033.762 (5005.255 ms): epoll_wait(epfd: 992<anon_inode:[eventpoll]>, events: 0x55fb9372ae08, maxevents: 1, timeout: 5000) = 0
 40039.070 ( 0.026 ms): close(fd: 992<anon_inode:[eventpoll]>                                 ) = 0
 40039.129 ( 0.025 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec83e07, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
 40039.179 ( 0.031 ms): epoll_create1(flags: 524288                                           ) = 992
 40039.233 ( 0.078 ms): epoll_ctl(epfd: 992<anon_inode:[eventpoll]>, op: ADD, fd: 12<pipe:[1568078]>, event: 0x7fff2ec83d7c) = 0
 40039.384 ( 0.047 ms): epoll_ctl(epfd: 992<anon_inode:[eventpoll]>, op: ADD, fd: 8<pipe:[1567713]>, event: 0x7fff2ec83d7c) = 0
 40039.464 ( 0.023 ms): gettimeofday(tv: 0x7fff2ec83d50                                       ) = 0
 40039.562 (5005.258 ms): epoll_wait(epfd: 992<anon_inode:[eventpoll]>, events: 0x55fb9372ae08, maxevents: 1, timeout: 5000) = 0
 45044.876 ( 0.025 ms): close(fd: 992<anon_inode:[eventpoll]>                                 ) = 0
 45044.933 ( 0.024 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec83e07, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
 45044.984 ( 0.029 ms): epoll_create1(flags: 524288                                           ) = 992
 45045.078 ( 0.118 ms): epoll_ctl(epfd: 992<anon_inode:[eventpoll]>, op: ADD, fd: 12<pipe:[1568078]>, event: 0x7fff2ec83d7c) = 0
 45045.225 ( 0.091 ms): epoll_ctl(epfd: 992<anon_inode:[eventpoll]>, op: ADD, fd: 8<pipe:[1567713]>, event: 0x7fff2ec83d7c) = 0
 45045.350 ( 0.027 ms): gettimeofday(tv: 0x7fff2ec83d50                                       ) = 0
 45045.441 (3034.775 ms): epoll_wait(epfd: 992<anon_inode:[eventpoll]>, events: 0x55fb9372ae08, maxevents: 1, timeout: 5000) = -1 EINTR Interrupted system call
 48080.277 ( 0.028 ms): write(fd: 13<pipe:[1568078]>, buf: 0x7fff2ec83737, count: 1           ) = 1
 48080.369 ( 0.021 ms): gettimeofday(tv: 0x7fff2ec83d60                                       ) = 0
 48080.413 ( 0.024 ms): close(fd: 992<anon_inode:[eventpoll]>                                 ) = 0
 48080.468 ( 0.023 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec83e07, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
 48080.515 ( 0.023 ms): gettimeofday(tv: 0x7fff2ec83e00                                       ) = 0
 48080.561 ( 0.024 ms): lseek(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, offset: 2490368, whence: SET) = 2490368
 48080.607 ( 0.036 ms): read(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, buf: 0x55fb9371ca98, count: 8192) = 8192
 48080.675 ( 0.024 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec84457, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
 48080.729 ( 0.045 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300539904
 48080.798 ( 0.025 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300539904
 48080.895 ( 0.053 ms): kill(pid: 11330 (postgres), sig: USR1                                 ) = 0
 48081.211 ( 0.132 ms): write(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, buf: 0x7f5d2e0cc500, count: 8192) = 8192
 48081.377 ( 0.025 ms): lseek(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, offset: 2498560, whence: SET) = 2498560
 48081.425 ( 0.032 ms): read(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, buf: 0x55fb9371ca98, count: 8192) = 8192
 48081.498 ( 0.043 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec84457, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
 48081.576 ( 0.024 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300548096
 48081.624 ( 0.024 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300548096
 48081.687 ( 0.122 ms): write(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, buf: 0x7f5d2e0ce500, count: 8192) = 8192
 48081.853 ( 0.025 ms): lseek(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, offset: 2506752, whence: SET) = 2506752
 48081.902 ( 0.041 ms): read(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, buf: 0x55fb9371ca98, count: 8192) = 8192
 48081.985 ( 0.025 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec84457, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
 48082.036 ( 0.049 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300556288
 48082.117 ( 0.038 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300556288
 48082.193 ( 0.064 ms): write(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, buf: 0x7f5d2e0d0500, count: 8192) = 8192
 48082.286 ( 0.024 ms): lseek(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, offset: 2514944, whence: SET) = 2514944
 48082.353 ( 0.039 ms): read(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, buf: 0x55fb9371ca98, count: 8192) = 8192
 48082.419 ( 0.024 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec84457, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
 48082.467 ( 0.024 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300564480
 48082.540 ( 0.034 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300564480
 48082.613 ( 0.057 ms): write(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, buf: 0x7f5d2e0d2500, count: 8192) = 8192
 48082.723 ( 0.044 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec84457, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
 48082.801 ( 0.040 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300572672
 48082.874 ( 0.025 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300572672
 48082.936 ( 0.061 ms): write(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, buf: 0x7f5d2e0d4500, count: 8192) = 8192
 48083.026 ( 0.024 ms): lseek(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, offset: 2523136, whence: SET) = 2523136
 48083.073 ( 0.054 ms): read(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, buf: 0x55fb9371ca98, count: 8192) = 8192
 48083.156 ( 0.040 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec84457, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
 48083.229 ( 0.033 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300580864
 48083.295 ( 0.038 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300580864
 48083.385 ( 0.085 ms): write(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, buf: 0x7f5d2e0d6500, count: 8192) = 8192
 48083.507 ( 0.045 ms): lseek(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, offset: 2531328, whence: SET) = 2531328
 48083.597 ( 0.039 ms): read(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, buf: 0x55fb9371ca98, count: 8192) = 8192
 48083.674 ( 0.033 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec84457, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
 48083.754 ( 0.041 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300589056
 48083.839 ( 0.048 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300589056
 48083.955 ( 0.086 ms): write(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, buf: 0x7f5d2e0d8500, count: 8192) = 8192
 48084.078 ( 0.096 ms): lseek(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, offset: 2539520, whence: SET) = 2539520
 48084.213 ( 0.071 ms): read(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, buf: 0x55fb9371ca98, count: 8192) = 8192
 48084.355 ( 0.041 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec84457, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
 48084.443 ( 0.052 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300597248
 48084.528 ( 0.056 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300597248
 48084.642 ( 0.103 ms): write(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, buf: 0x7f5d2e0da500, count: 8192) = 8192
 48084.809 ( 0.048 ms): lseek(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, offset: 2547712, whence: SET) = 2547712
 48084.901 ( 0.048 ms): read(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, buf: 0x55fb9371ca98, count: 8192) = 8192
 48085.015 ( 0.036 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec84457, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
 48085.095 ( 0.032 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300605440
 48085.164 ( 0.041 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300605440
 48085.266 ( 0.103 ms): write(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, buf: 0x7f5d2e0dc500, count: 8192) = 8192
 48085.419 ( 0.047 ms): lseek(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, offset: 2555904, whence: SET) = 2555904
 48085.505 ( 0.064 ms): read(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, buf: 0x55fb9371ca98, count: 8192) = 8192
 48085.612 ( 0.036 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec84457, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
 48085.707 ( 0.055 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300613632
 48085.805 ( 0.052 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300613632
 48085.893 ( 0.093 ms): write(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, buf: 0x7f5d2e0de500, count: 8192) = 8192
 48086.027 ( 0.048 ms): lseek(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, offset: 2564096, whence: SET) = 2564096
 48086.120 ( 0.071 ms): read(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, buf: 0x55fb9371ca98, count: 8192) = 8192
 48086.247 ( 0.048 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec84457, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
 48086.342 ( 0.040 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300621824
 48086.421 ( 0.041 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300621824
 48086.499 ( 0.103 ms): write(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, buf: 0x7f5d2e0e0500, count: 8192) = 8192
 48086.641 ( 0.053 ms): lseek(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, offset: 2572288, whence: SET) = 2572288
 48086.740 ( 0.060 ms): read(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, buf: 0x55fb9371ca98, count: 8192) = 8192
 48086.871 ( 0.041 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec84457, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
 48086.966 ( 0.026 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300630016
 48087.025 ( 0.031 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300630016
 48087.135 ( 0.084 ms): write(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, buf: 0x7f5d2e0e2500, count: 8192) = 8192
 48087.281 ( 0.055 ms): lseek(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, offset: 2580480, whence: SET) = 2580480
 48087.392 ( 0.061 ms): read(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, buf: 0x55fb9371ca98, count: 8192) = 8192
 48087.516 ( 0.040 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec84457, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
 48087.594 ( 0.043 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300638208
 48087.669 ( 0.048 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300638208
 48087.768 ( 0.102 ms): write(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, buf: 0x7f5d2e0e4500, count: 8192) = 8192
 48087.915 ( 0.057 ms): lseek(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, offset: 2588672, whence: SET) = 2588672
 48088.026 ( 0.059 ms): read(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, buf: 0x55fb9371ca98, count: 8192) = 8192
 48088.156 ( 0.041 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec84457, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
 48088.244 ( 0.032 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300646400
 48088.321 ( 0.054 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300646400
 48088.413 ( 0.102 ms): write(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, buf: 0x7f5d2e0e6500, count: 8192) = 8192
 48088.554 ( 0.046 ms): lseek(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, offset: 2596864, whence: SET) = 2596864
 48088.655 ( 0.061 ms): read(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, buf: 0x55fb9371ca98, count: 8192) = 8192
 48088.779 ( 0.041 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec84457, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
 48088.865 ( 0.056 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300654592
 48088.954 ( 0.036 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300654592
 48089.049 ( 0.050 ms): write(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, buf: 0x7f5d2e0e8500, count: 8192) = 8192
 48089.153 ( 0.037 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec84457, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
 48089.261 ( 0.049 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300662784
 48089.355 ( 0.071 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300662784
 48089.463 ( 0.083 ms): write(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, buf: 0x7f5d2e0ea500, count: 8192) = 8192
 48089.616 ( 0.054 ms): lseek(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, offset: 2605056, whence: SET) = 2605056
 48089.724 ( 0.064 ms): read(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, buf: 0x55fb9371ca98, count: 8192) = 8192
 48089.847 ( 0.044 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec84457, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
 48089.951 ( 0.051 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300670976
 48090.046 ( 0.044 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300670976
 48090.150 ( 0.089 ms): write(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, buf: 0x7f5d2e0ec500, count: 8192) = 8192
 48090.297 ( 0.045 ms): lseek(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, offset: 2613248, whence: SET) = 2613248
 48090.393 ( 0.039 ms): read(fd: 3</mnt/PGPOOL/postgresql/9.6/main/pg_xlog/00000001000011430000002D>, buf: 0x55fb9371ca98, count: 8192) = 8192
 48090.483 ( 0.038 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec84457, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
 48090.574 ( 0.056 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300679168
 48090.672 ( 0.046 ms): lseek(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, whence: END) = 300679168
 48090.786 ( 0.182 ms): write(fd: 652</mnt/PGPOOL/postgresql/9.6/main/base/36176/5155783>, buf: 0x7f5d2e0ee500, count: 8192) = 8192
 48091.030 ( 0.059 ms): epoll_create1(flags: 524288                                           ) = 992
 48091.135 ( 0.056 ms): epoll_ctl(epfd: 992<anon_inode:[eventpoll]>, op: ADD, fd: 12<pipe:[1568078]>, event: 0x7fff2ec83d7c) = 0
 48091.235 ( 0.056 ms): epoll_ctl(epfd: 992<anon_inode:[eventpoll]>, op: ADD, fd: 8<pipe:[1567713]>, event: 0x7fff2ec83d7c) = 0
 48091.336 ( 0.051 ms): gettimeofday(tv: 0x7fff2ec83d50                                       ) = 0
 48091.431 ( 0.055 ms): epoll_wait(epfd: 992<anon_inode:[eventpoll]>, events: 0x55fb9372ae08, maxevents: 1, timeout: 5000) = 1
 48091.542 ( 0.034 ms): read(fd: 12<pipe:[1568078]>, buf: 0x7fff2ec83d70, count: 16           ) = 1
 48091.629 ( 0.044 ms): gettimeofday(tv: 0x7fff2ec83d60                                       ) = 0
 48091.717 (5005.261 ms): epoll_wait(epfd: 992<anon_inode:[eventpoll]>, events: 0x55fb9372ae08, maxevents: 1, timeout: 5000) = 0
 53097.027 ( 0.023 ms): close(fd: 992<anon_inode:[eventpoll]>                                 ) = 0
 53097.079 ( 0.024 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec83e07, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
 53097.127 ( 0.026 ms): epoll_create1(flags: 524288                                           ) = 992
 53097.177 ( 0.025 ms): epoll_ctl(epfd: 992<anon_inode:[eventpoll]>, op: ADD, fd: 12<pipe:[1568078]>, event: 0x7fff2ec83d7c) = 0
 53097.222 ( 0.021 ms): epoll_ctl(epfd: 992<anon_inode:[eventpoll]>, op: ADD, fd: 8<pipe:[1567713]>, event: 0x7fff2ec83d7c) = 0
 53097.264 ( 0.033 ms): gettimeofday(tv: 0x7fff2ec83d50                                       ) = 0
 53097.313 (5005.226 ms): epoll_wait(epfd: 992<anon_inode:[eventpoll]>, events: 0x55fb9372ae08, maxevents: 1, timeout: 5000) = 0
 58102.587 ( 0.022 ms): close(fd: 992<anon_inode:[eventpoll]>                                 ) = 0
 58102.638 ( 0.030 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec83e07, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
 58102.690 ( 0.024 ms): epoll_create1(flags: 524288                                           ) = 992
 58102.746 ( 0.025 ms): epoll_ctl(epfd: 992<anon_inode:[eventpoll]>, op: ADD, fd: 12<pipe:[1568078]>, event: 0x7fff2ec83d7c) = 0
 58102.795 ( 0.033 ms): epoll_ctl(epfd: 992<anon_inode:[eventpoll]>, op: ADD, fd: 8<pipe:[1567713]>, event: 0x7fff2ec83d7c) = 0
 58102.850 ( 0.022 ms): gettimeofday(tv: 0x7fff2ec83d50                                       ) = 0
 58102.905 (5005.237 ms): epoll_wait(epfd: 992<anon_inode:[eventpoll]>, events: 0x55fb9372ae08, maxevents: 1, timeout: 5000) = 0
 63108.190 ( 0.023 ms): close(fd: 992<anon_inode:[eventpoll]>                                 ) = 0
 63108.243 ( 0.025 ms): read(fd: 8<pipe:[1567713]>, buf: 0x7fff2ec83e07, count: 1             ) = -1 EAGAIN Resource temporarily unavailable
 63108.293 ( 0.026 ms): epoll_create1(flags: 524288                                           ) = 992
 63108.342 ( 0.027 ms): epoll_ctl(epfd: 992<anon_inode:[eventpoll]>, op: ADD, fd: 12<pipe:[1568078]>, event: 0x7fff2ec83d7c) = 0
 63108.389 ( 0.021 ms): epoll_ctl(epfd: 992<anon_inode:[eventpoll]>, op: ADD, fd: 8<pipe:[1567713]>, event: 0x7fff2ec83d7c) = 0
 63108.431 ( 0.032 ms): gettimeofday(tv: 0x7fff2ec83d50                                       ) = 0

Seems like epoll_wait is taking too much time of each operation. Is there anything needs to be done to address that? Or any other detailed stacktrace needs to be analysed?

Infrastructure

  • AWS EC2 r4.large
  • ZFS FileSystem
  • RAID0 – 8 x 17GB
  • Number of databases – 130+
  • Heavy ETL Process

Best Answer

I think you have misdiagnosed your problem. Even if replay is stuck, the replica should still receive WAL from the master, advancing the slot allowing WAL on the master to be recycled. It is pg_xlog on the replica, not on the master, which will bloat with the accumulated WAL files waiting for replay.

The trace you show is typical behavior of a replica which is idle because it is all caught up and the master is also idle and so there is no new WAL arriving.

Whatever is causing your problem on the master, it is unlikely to be this replica.

Related Question