Postgresql – Postgres Streaming Replication lagging, using lots of CPU and little I/O

postgresqlreplication

We have 2 dedicated DB Servers. One is master and other is its streaming slave.
Postgres version 9.0.17.

We have similar hardware on both servers and using 15krpm SAS drive in Raid 10 for Database cluster. The only difference is that on server we have 128 GB ram and on slave we have 64 GB ram.

The issue which we are facing is that our streaming slave is lagging behind.

On the streaming slave we have

postgres=# select pg_last_xlog_receive_location();
 pg_last_xlog_receive_location
-------------------------------
 1F7D/DD3BA000


postgres=# select pg_last_xlog_replay_location();
 pg_last_xlog_replay_location
------------------------------
 1F75/5BE08980

This difference between wal received and replay is increasing constantly.

Please let me know that how this issue can be resolved.

Output of perf top -u postgres is

98.30%  postgres          (.) DropRelFileNodeBuffers
0.20%  postgres          (.) 0x000000000008b05b
0.14%  (kernel)          (k) copy_user_generic_string
0.13%  postgres          (.) hash_seq_search
0.07%  postgres          (.) hash_search_with_hash_value
0.05%  (kernel)          (k) _spin_lock
0.05%  (kernel)          (k) apic_timer_interrupt
0.03%  (kernel)          (k) do_timer
0.03%  postgres          (.) RememberFsyncRequest
0.03%  (kernel)          (k) __sb_start_write
0.02%  (jbd2)            (k) do_get_write_access
0.02%  (kernel)          (k) unroll_tree_refs
0.02%  (cciss)           (k) do_cciss_request
0.02%  (ext4)            (k) ext4_check_dir_entry
0.02%  (kernel)          (k) path_walk
0.02%  (kernel)          (k) scheduler_tick
0.02%  postgres          (.) LWLockAcquire
0.02%  (kernel)          (k) dyntick_save_progress_counter
0.02%  (kernel)          (k) _spin_lock_irqsave
0.02%  (kernel)          (k) audit_syscall_entry
0.02%  (ext4)            (k) ext4_journal_start_sb
0.02%  (kernel)          (k) radix_tree_range_tag_if_tagged
0.01%  (kernel)          (k) sys_getppid
0.01%  (kernel)          (k) native_read_tsc
0.01%  libc-2.12.so      (.) _IO_str_init_static_internal
0.01%  (kernel)          (k) __getblk
0.01%  (kernel)          (k) run_timer_softirq
0.01%  (kernel)          (k) system_call
0.01%  postgres          (.) __errno_location@plt
0.01%  libc-2.12.so      (.) __GI___libc_lseek64
0.01%  postgres          (.) XLogReadBufferExtended
0.01%  (kernel)          (k) memcpy
0.01%  (kernel)          (k) irq_work_run
0.01%  postgres          (.) fsm_set_avail
0.01%  (kernel)          (k) kmem_cache_free
0.01%  (kernel)          (k) do_unlinkat
0.01%  libc-2.12.so      (.) vfprintf
0.01%  postgres          (.) hash_any
0.01%  (kernel)          (k) sysret_check
0.01%  (kernel)          (k) mutex_lock
0.01%  (kernel)          (k) printk_tick
0.01%  postgres          (.) elog_start
0.01%  (kernel)          (k) __percpu_counter_add
0.01%  (kernel)          (k) file_update_time
0.01%  (kernel)          (k) enqueue_task_fair
0.01%  (kernel)          (k) blk_recount_segments
0.01%  (kernel)          (k) rcu_check_callbacks
0.01%  (kernel)          (k) put_pid
0.01%  postgres          (.) mdunlink
0.01%  postgres          (.) LWLockRelease
0.01%  (kernel)          (k) current_kernel_time
0.01%  (kernel)          (k) perf_event_task_tick
0.01%  (kernel)          (k) native_apic_mem_write

output of gdb for postgres startup process

(gdb) thread apply all backtrace Thread 1 (Thread 0x7fd06d28f7c0 (LWP 26949)):
#0  0x00000000005ec320 in DropRelFileNodeBuffers ()
#1  0x0000000000603115 in smgrdounlink ()
#2  0x000000000047d992 in ?? ()
#3  0x000000000048eca9 in StartupXLOG ()
#4  0x0000000000490c58 in StartupProcessMain ()
#5  0x00000000004a6faf in AuxiliaryProcessMain ()
#6  0x00000000005d31c3 in ?? ()
#7  0x00000000005d7372 in PostmasterMain ()
#8  0x000000000057e320 in main ()

Best Answer

Current (true at least in PostgreSQL 9.4 and older) PostgreSQL releases have single-threaded WAL recovery.

This means that replay of the write-ahead log occurs in only one recovery worker, and is thus able to benefit less from I/O concurrency than a normal running master. This can result in WAL replay lagging behind in cases where the replica and master have similar hardware, even when you'd expect the master (which is generally also under more load) to be the slower server.

(It'd be great to improve this, but wanted it enough to do the work yet, and nobody seems keen to fund it at the moment).

That said, your case with very high CPU use doesn't seem to fit with issues of low I/O concurrency. In your situation I'd be attaching gdb and seeing what the replay proess was doing, or using perf top to examine what the system as a whole is up to, then dig deeper with perf once I had some clues. I'd also be looking closely at iotop, vmstat, iostat, the PostgreSQL logs, dmesg, etc.

After investigation

Profiles revealed that most time was being spent in DropRelFileNodeBuffers.

That does a linear scan through shared_buffers whenever a relfilenode is deleted - caused by truncate, drop table, cluster, drop index, etc. This must be done during WAL replay, as well as on the main node.

So this suggests that:

  • Your shared_buffers is probably very big; and
  • You're probably doing lots of operations that delete relfilenodes

Reducing shared_buffers on the replica may well help.