RHEL7 – Addressing Memory Fragmentation Issues

linuxmemoryrhel

I have a long-time working server application, which should work trouble-free for months. After moving an appliance to RHEL7, system started to suffer from memory fragmentation after ~2-3 days of usual load. There are a lot of "page allocation failure" messages from kernel, indicating the inability to allocate the 4 order pages in Normal Zone (while there are lots of low order pages) for almost each process. Here's an example:

kernel: [85531.010995] sh: page allocation failure: order:4, mode:0x2040d0
kernel: [85531.011000] CPU: 1 PID: 20846 Comm: sh Not tainted 3.10.0-693.el7.AV1.x86_64 #1
kernel: [85531.011002] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015
kernel: [85531.011003]  00000000002040d0 00000000d00413f4 ffff8800070ffa18 ffffffff816a3e1d
kernel: [85531.011006]  ffff8800070ffaa8 ffffffff81188d00 0000000000000000 ffff88023ffd8000
kernel: [85531.011008]  0000000000000004 00000000002040d0 ffff8800070ffaa8 00000000d00413f4
kernel: [85531.011010] Call Trace:
kernel: [85531.011018]  [<ffffffff816a3e1d>] dump_stack+0x19/0x1b
kernel: [85531.011023]  [<ffffffff81188d00>] warn_alloc_failed+0x110/0x180
kernel: [85531.011026]  [<ffffffff8169fe1a>] __alloc_pages_slowpath+0x6b6/0x724
kernel: [85531.011028]  [<ffffffff8118d275>] __alloc_pages_nodemask+0x405/0x420
kernel: [85531.011031]  [<ffffffff811d15f8>] alloc_pages_current+0x98/0x110
kernel: [85531.011035]  [<ffffffff811dc36c>] new_slab+0x2fc/0x310
kernel: [85531.011037]  [<ffffffff811ddbfc>] ___slab_alloc+0x3ac/0x4f0
kernel: [85531.011042]  [<ffffffff810850be>] ? copy_process+0x18e/0x19a0
kernel: [85531.011044]  [<ffffffff810850be>] ? copy_process+0x18e/0x19a0
kernel: [85531.011046]  [<ffffffff816a117e>] __slab_alloc+0x40/0x5c
kernel: [85531.011049]  [<ffffffff811e00cb>] kmem_cache_alloc_node+0x8b/0x200
kernel: [85531.011051]  [<ffffffff810850be>] copy_process+0x18e/0x19a0
kernel: [85531.011053]  [<ffffffff81086a81>] do_fork+0x91/0x320
kernel: [85531.011056]  [<ffffffff81086d96>] SyS_clone+0x16/0x20
kernel: [85531.011059]  [<ffffffff816b5259>] stub_clone+0x69/0x90
kernel: [85531.011061]  [<ffffffff816b4f09>] ? system_call_fastpath+0x16/0x1b
kernel: [85531.011062] Mem-Info:
kernel: [85531.011066] active_anon:1145227 inactive_anon:278512 isolated_anon:0
kernel: [85531.011066]  active_file:181319 inactive_file:185784 isolated_file:0
kernel: [85531.011066]  unevictable:2695 dirty:4333 writeback:0 unstable:0
kernel: [85531.011066]  slab_reclaimable:45889 slab_unreclaimable:54798
kernel: [85531.011066]  mapped:79471 shmem:52418 pagetables:11994 bounce:0
kernel: [85531.011066]  free:33850 free_pcp:0 free_cma:0
kernel: [85531.011069] Node 0 DMA free:15868kB min:132kB low:164kB high:196kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
kernel: [85531.011073] lowmem_reserve[]: 0 2809 7800 7800
kernel: [85531.011076] Node 0 DMA32 free:53892kB min:24292kB low:30364kB high:36436kB active_anon:1622080kB inactive_anon:516652kB active_file:203244kB inactive_file:212104kB unevictable:2312kB isolated(anon):0kB isolated(file):0kB present:3129280kB managed:2878656kB mlocked:2312kB dirty:6236kB writeback:0kB mapped:115972kB shmem:79808kB slab_reclaimable:77740kB slab_unreclaimable:90500kB kernel_stack:13680kB pagetables:17624kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
kernel: [85531.011080] lowmem_reserve[]: 0 0 4990 4990
kernel: [85531.011082] Node 0 Normal free:65640kB min:43152kB low:53940kB high:64728kB active_anon:2958828kB inactive_anon:597396kB active_file:522032kB inactive_file:531032kB unevictable:8468kB isolated(anon):0kB isolated(file):0kB present:5242880kB managed:5110372kB mlocked:8464kB dirty:11096kB writeback:0kB mapped:201912kB shmem:129864kB slab_reclaimable:105816kB slab_unreclaimable:128684kB kernel_stack:19936kB pagetables:30352kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
kernel: [85531.011085] lowmem_reserve[]: 0 0 0 0
kernel: [85531.011087] Node 0 DMA: 1*4kB (U) 1*8kB (U) 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15868kB
kernel: [85531.011095] Node 0 DMA32: 2946*4kB (UEM) 1995*8kB (UEM) 1241*16kB (UEM) 186*32kB (UEM) 9*64kB (U) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 54128kB
kernel: [85531.011102] Node 0 Normal: 16005*4kB (UEM) 248*8kB (UEM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 66004kB
kernel: [85531.011108] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
kernel: [85531.011109] 428930 total pagecache pages
kernel: [85531.011110] 8261 pages in swap cache
kernel: [85531.011111] Swap cache stats: add 51264, delete 43003, find 2892763/2894481
kernel: [85531.011112] Free swap  = 5078128kB
kernel: [85531.011113] Total swap = 5242876kB
kernel: [85531.011114] 2097038 pages RAM
kernel: [85531.011114] 0 pages HighMem/MovableOnly
kernel: [85531.011115] 95804 pages reserved
kernel: [85531.011116] SLUB: Unable to allocate memory on node -1 (gfp=0xd0)
kernel: [85531.011118]   cache: task_struct, object size: 45024, buffer size: 45024, default order: 4, min order: 4
kernel: [85531.011119]   node 0: slabs: 2114, objs: 2114, free: 0

Thus, I have some questions:

What can affect the memory fragmentation on the system?
Is it possible to determine what process is causing the fragmentation (e.g. what process is using 4 order pages most)?
And, of course, how can I tune the system to avoid the memory fragmentation?

UPD:

I've found out that CONFIG_COMPACTION option can help in my case, but cannot find how to enable it or check its current state. So, how can I check/enable it?

Everything worked fine before on RHEL6 and RHEL5.

# uname -a
Linux <hostname> 3.10.0-693.21.1.el7.AV1.x86_64 #1 SMP Thursday April 5, 2018 09:26:08 MDT x86_64 x86_64 x86_64 GNU/Linux

That is a VM running on ESXi 6.5

UPD1: The system is suffering from lack of order 4 pages again right now. The kernel message shows that at the moment of allocation there was enough pages in DMA32 zone, but 0 in Normal zone.

[82794.805373] Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15860kB
[82794.805384] Node 0 DMA32: 4528*4kB (UEM) 2604*8kB (UEM) 1544*16kB (UEM) 142*32kB (UE) 19*64kB (EM) 3*128kB (U) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 69792kB
[82794.805393] Node 0 Normal: 17041*4kB (UEM) 183*8kB (UEM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 69628kB

Is it possible to somehow make system allocate in DMA32? I'm not an expert in that sphere, so any information is appreciated.

UPD2 I've tried to play with kernel parameters such as vm.swappiness and vm.dirty_ratio but it only postponed failures occurrence. Also, increasing the amount of memory didn't help.

UPD3 Dropping kernel caches with echo 3 > /proc/sys/vm/drop_caches helps to avoid "page allocation failures" for a while. But I understand that it is not a permanent solution since it affects performance.

Best Answer

Trying to tune kernel to avoid page allocation failures I got to this values:

vm.swappiness=10
vm.dirty_ratio=20
vm.vfs_cache_pressure=400

With that configuration frequency of failures' occurrence decreases to minimum. Also, it was found that one of the high-loaded long-running processes leaks memory, which could be the reason for fragmentation as well..

Further investigations

Following my gut feeling that zram was behind this behavious, I setted up a VM with similar spec as your machine: 4 GB RAM and 2 GB zram swap, no swap file.

I have loaded the VM with heavy weight applications and got the following state:

huygens@ubuntu:~$ smem -wt -K ~/vmlinuz-3.2.0-38-generic.unpacked -R 4096M
Area                           Used      Cache   Noncache 
firmware/hardware            130717          0     130717 
kernel image                  13951          0      13951 
kernel dynamic memory       1063520     922172     141348 
userspace memory            2534684     257136    2277548 
free memory                  451432     451432          0 
----------------------------------------------------------
                            4194304    1630740    2563564 
huygens@ubuntu:~$ free -m
             total       used       free     shared    buffers     cached
Mem:          3954       3528        426          0         79        858
-/+ buffers/cache:       2589       1365
Swap:         1977          0       1977

As you can see free reports 858 MB cache memory and that is also what smem seems to report within the cached kernel dynamic memory.

Then I further stressed the system using Chromium Browser. At the beginning, it was only have 83 MB of swap used. But then after a few more tabs opened, the swap switch quickly to almost it's maximum and I experienced OOM! zram has really a dangerous side where wrongly configured (too big sizes) it can quickly hit you back like a trebuchet-like mechanism.

At that time I had the following outputs:

huygens@ubuntu:~$ smem -wt -K ~/vmlinuz-3.2.0-38-generic.unpacked -R 4096M
Area                           Used      Cache   Noncache 
firmware/hardware            130717          0     130717 
kernel image                  13951          0      13951 
kernel dynamic memory       1355344     124072    1231272 
userspace memory             961004      36456     924548 
free memory                 1733288    1733288          0 
----------------------------------------------------------
                            4194304    1893816    2300488 
huygens@ubuntu:~$ free -m
             total       used       free     shared    buffers     cached
Mem:          3954       2256       1698          0          4        132
-/+ buffers/cache:       2118       1835
Swap:         1977       1750        227

See how the kernel dynamic memory (columns cache and non-cache) look like inverted? It is because in the first case, the kernel had "cached" memory such as reported by free but then it had swap memory held by zram which smem does not know how to compute (check smem source code, zram occupation is not reported in /proc/meminfo, this it is not computed by smem which does simple "total kernel mem" - "type of memory reported by meminfo that I know are cache", what it does not know is that in the computed total kernel mem it has added the size of the swap which is in RAM!)

When I was in this state, I activated a hard disk swap and turned off the zram swap and I reset the zram devices: echo 1 > /sys/block/zram0/reset.

After that the noncache kernel memory melted like snow in summer and returned to "normal" value.

Conclusion

smem does not know about zram (yet) maybe because it is still staging and thus not part of /proc/meminfo which reports global parameters (like (in)active pages size, total memory) and then only report on a few specific parameters. smem identified a few of this specific parameters as "cache", sum them up and compare that to total memory. Because of that zram used memory gets counted in the noncache column.

Note: by the way, in modern kernel, meminfo reports also the shared memory consumed. smem does not take that yet into account, so even without zram the output of smem is to consider carefully esp. if you use application that make big use of shared memory.

References used:

Best Answer

Related Solutions

Linux – Defragging RAM / OOM failure

Linux – “kernel dynamic memory” as reported by smem

Further investigations

Conclusion

Related Question