Postgres Latency Issues – Memory Compaction on EC2 (Ubuntu 14.04)

amazon ec2linuxmemorypostgresqlUbuntu

We've recently upgraded our EC2 instance that hosts our Postgres database to an i2.8xlarge with 244GB of memory (this is to utilise the large amounts of ephemeral storage it comes with). Since upgrading, we've been having some issues with latency in Postgres that appear to be due to memory compaction that's occurring in the Linux kernel.

We're using PostgreSQL 9.3 on a recent Ubuntu 14.04 kernel running the following (hopefully relevant subset of) config:

max_connections = 1000
effective_cache_size = '220GB'
shared_buffers = '24GB'
work_mem = '25MB'
maintenance_work_mem = '1024MB'
fsync = off
full_page_writes = on
synchronous_commit = off

We have transparent huge pages completely disabled on this server (/sys/kernel/mm/transparent_hugepage/enabled and /sys/kernel/mm/transparent_hugepage/defrag are both set to never and /sys/kernel/mm/transparent_hugepage/khugepaged/defrag is set to 0) and we're fairly sure that we're not seeing any issues as a result of THP because the thp_* stats and nr_anon_transparent_hugepages stat in /proc/vmstat never increment.

Our issue is that we see constant memory compaction (failure and success) events in /proc/vmstat (all the stats under compact_* incrementing frequently) and some of these cause pretty severe stalls that get worse over time (presumably as memory fragmentation gets worse) and impact on our application. We're tracking the stats from /sys/kernel/debug/extfrag/unusable_index and often see a flurry of movement between the different page orders when we see stall-causing events.

We're wondering whether this is just some combination of Postgres version, Linux kernel version and having to deal with a large amount of memory (as obviously most of the memory usage is file cache, so Linux might be doing things with that that Postgres isn't happy about), but haven't been able to come up with anything other than assuming a more recent version of Postgres (9.4 or 9.5) might avoid the issue altogether for some reason.

$ uname -a
Linux db-04 3.13.0-91-generic #138-Ubuntu SMP Fri Jun 24 17:00:34 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
$ dpkg -l postgresql-9.3
postgresql-9.3     9.3.13-1.pgdg14.04+1

We also tried reducing the effective_cache_size on the instance to 160GB to see if we could reduce memory pressure but that didn't change much (and mostly seemed to make the stalling worse).

Just wondering if memory stalls on Postgres is something that's been raised before or that people have experience with?

Best Answer

As dezso mentioned in the question comments, this did seem to be an issue with (possibly more recent versions of the) 3.13 kernel in Ubuntu Trusty - we switched to the Xenial HWE 4.4 kernel in Trusty and the problem seems to have gone away and compaction stalls are now very small and don't interfere.