We have 2 dedicated DB Servers. One is master and other is its streaming slave.
Postgres version 9.0.17.
We have similar hardware on both servers and using 15krpm SAS drive in Raid 10 for Database cluster. The only difference is that on server we have 128 GB ram and on slave we have 64 GB ram.
The issue which we are facing is that our streaming slave is lagging behind.
On the streaming slave we have
postgres=# select pg_last_xlog_receive_location();
pg_last_xlog_receive_location
-------------------------------
1F7D/DD3BA000
postgres=# select pg_last_xlog_replay_location();
pg_last_xlog_replay_location
------------------------------
1F75/5BE08980
This difference between wal received and replay is increasing constantly.
Please let me know that how this issue can be resolved.
Output of perf top -u postgres
is
98.30% postgres (.) DropRelFileNodeBuffers
0.20% postgres (.) 0x000000000008b05b
0.14% (kernel) (k) copy_user_generic_string
0.13% postgres (.) hash_seq_search
0.07% postgres (.) hash_search_with_hash_value
0.05% (kernel) (k) _spin_lock
0.05% (kernel) (k) apic_timer_interrupt
0.03% (kernel) (k) do_timer
0.03% postgres (.) RememberFsyncRequest
0.03% (kernel) (k) __sb_start_write
0.02% (jbd2) (k) do_get_write_access
0.02% (kernel) (k) unroll_tree_refs
0.02% (cciss) (k) do_cciss_request
0.02% (ext4) (k) ext4_check_dir_entry
0.02% (kernel) (k) path_walk
0.02% (kernel) (k) scheduler_tick
0.02% postgres (.) LWLockAcquire
0.02% (kernel) (k) dyntick_save_progress_counter
0.02% (kernel) (k) _spin_lock_irqsave
0.02% (kernel) (k) audit_syscall_entry
0.02% (ext4) (k) ext4_journal_start_sb
0.02% (kernel) (k) radix_tree_range_tag_if_tagged
0.01% (kernel) (k) sys_getppid
0.01% (kernel) (k) native_read_tsc
0.01% libc-2.12.so (.) _IO_str_init_static_internal
0.01% (kernel) (k) __getblk
0.01% (kernel) (k) run_timer_softirq
0.01% (kernel) (k) system_call
0.01% postgres (.) __errno_location@plt
0.01% libc-2.12.so (.) __GI___libc_lseek64
0.01% postgres (.) XLogReadBufferExtended
0.01% (kernel) (k) memcpy
0.01% (kernel) (k) irq_work_run
0.01% postgres (.) fsm_set_avail
0.01% (kernel) (k) kmem_cache_free
0.01% (kernel) (k) do_unlinkat
0.01% libc-2.12.so (.) vfprintf
0.01% postgres (.) hash_any
0.01% (kernel) (k) sysret_check
0.01% (kernel) (k) mutex_lock
0.01% (kernel) (k) printk_tick
0.01% postgres (.) elog_start
0.01% (kernel) (k) __percpu_counter_add
0.01% (kernel) (k) file_update_time
0.01% (kernel) (k) enqueue_task_fair
0.01% (kernel) (k) blk_recount_segments
0.01% (kernel) (k) rcu_check_callbacks
0.01% (kernel) (k) put_pid
0.01% postgres (.) mdunlink
0.01% postgres (.) LWLockRelease
0.01% (kernel) (k) current_kernel_time
0.01% (kernel) (k) perf_event_task_tick
0.01% (kernel) (k) native_apic_mem_write
output of gdb for postgres startup process
(gdb) thread apply all backtrace Thread 1 (Thread 0x7fd06d28f7c0 (LWP 26949)):
#0 0x00000000005ec320 in DropRelFileNodeBuffers ()
#1 0x0000000000603115 in smgrdounlink ()
#2 0x000000000047d992 in ?? ()
#3 0x000000000048eca9 in StartupXLOG ()
#4 0x0000000000490c58 in StartupProcessMain ()
#5 0x00000000004a6faf in AuxiliaryProcessMain ()
#6 0x00000000005d31c3 in ?? ()
#7 0x00000000005d7372 in PostmasterMain ()
#8 0x000000000057e320 in main ()
Best Answer
Current (true at least in PostgreSQL 9.4 and older) PostgreSQL releases have single-threaded WAL recovery.
This means that replay of the write-ahead log occurs in only one recovery worker, and is thus able to benefit less from I/O concurrency than a normal running master. This can result in WAL replay lagging behind in cases where the replica and master have similar hardware, even when you'd expect the master (which is generally also under more load) to be the slower server.
(It'd be great to improve this, but wanted it enough to do the work yet, and nobody seems keen to fund it at the moment).
That said, your case with very high CPU use doesn't seem to fit with issues of low I/O concurrency. In your situation I'd be attaching
gdb
and seeing what the replay proess was doing, or usingperf top
to examine what the system as a whole is up to, then dig deeper withperf
once I had some clues. I'd also be looking closely atiotop
,vmstat
,iostat
, the PostgreSQL logs,dmesg
, etc.After investigation
Profiles revealed that most time was being spent in
DropRelFileNodeBuffers
.That does a linear scan through
shared_buffers
whenever a relfilenode is deleted - caused bytruncate
,drop table
,cluster
,drop index
, etc. This must be done during WAL replay, as well as on the main node.So this suggests that:
shared_buffers
is probably very big; andReducing
shared_buffers
on the replica may well help.