Linux Debian – Permanent Swapping with Lots of Free Memory

cgroupsdebianlinuxlxcswap

We have a Linux server running Debian 4.0.5 (Kernel 4.0.0-2) with 32G RAM installed and 16G Swap configured. The system uses lxc containers for compartmentalisation, but that shouldn't matter here. The issue exists inside and out of different containers.

Here's a typical free -h:

              total        used        free      shared  buff/cache   available
Mem:            28G        2.1G         25G         15M        936M         26G
Swap:           15G        1.4G         14G

/proc/meminfo has

Committed_AS:   12951172 kB

So there's plenty of free memory, even if everything allocated was actually used at once. However, the system is instantly paging even running processes.

This is most notable with Gitlab, a Rails application using Unicorn: newly forked Unicorn workers are instantly swapped, and when a request comes in need to be read from disk at ~1400kB/s (data from iotop) and runs into timeouts (30s for now, to get it restarted in time. No normal request should take more than 5s) before it gets loaded into memory completely, thus getting instantly killed. Note that this is just an example, I have seen this happen to redis, amavis, postgres, mysql, java(openjdk) and others.

The system is otherwise in a low-load situation with about 5% CPU utilization and a loadavg around 2 (on 8 cores).

What we tried (in no particular order):

swapoff -a: fails at about 800M still swapped
Reducing swappiness (in steps) using sysctl vm.swappiness=NN. This seems to have no impact at all, we went down to 0% and still exactly the same behaviour exists
Stopping non-essential services (Gitlab, a Jetty-based webapp…), freeing ca. 8G of committed-but-not-mapped memory and bringing Committed_AS down to about 5G. No change at all.
Clearing system caches using sync && echo 3 > /proc/sys/vm/drop_caches. This frees up memory, but does nothing to the swap situation.
Combinations of the above

Restarting the machine to completely disable swap via fstab as a test is not really an option, as some services have availability issues and need planned downtimes, not "poking around"… and also we don't really want to disable swap as a fallback.

I don't see why there is any swapping occuring here. Any ideas what may be going on?

This problem has existed for a while now, but it showed up first during a period of high IO load (long background data processing task), so I can't pinpoint a specific event. This task is done for some days and the problem persists, hence this question.

Best Answer

Remember how I said:

The system uses lxc containers for compartmentalisation, but that shouldn't matter here.

Well, turns out it did matter. Or rather, the cgroups at the heart of lxc matter.

The host machine only sees reboots for kernel upgrades. So, what were the last kernels used? 3.19, replaced by 4.0.5 2 months ago and yesterday with 4.1.3. And what happened yesterday? Processes getting memkilled left, right and center. Checking /var/log/kern.log, the affected processes were in cgroups with 512M memory. Wait, 512M? That can't be right (when the expected requirement is around 4G!). As it turns out, this is exactly what we configured in the lxc configs when setting this all up months ago.

So, what happened is that 3.19 completely ignored the memory limit for cgroups; 4.0.5 always paged if the cgroup required more than allowed (this is the core issue of this question) and only 4.1.3 does a full memkiller-sweep.

The swappiness of the host system had no influence on this, since it never was anywhere near being out of physical memory.

The solution:

For a temporary change, you can directly modify the cgroup, for example for an lxc container named box1 the cgroup is called lxc/box1 and you may execute (as root in the host machine):

$ echo 8G > /sys/fs/cgroup/memory/lxc/box1/memory.limit_in_bytes

The permanent solution is to correctly configure the container in /var/lb/lxc/...

lxc.cgroup.memory.limit_in_bytes = 8G

Moral of the story: always check your configuration. Even if you think it can't possibly be the issue (and takes a different bug/inconsistency in the kernel to actually fail).

Related Solutions

Linux – Is it possible to trigger OOM-killer on forced swapping

I also struggled with that issue. I just want my system to stay responsive, no matter what, and I prefer losing processes to waiting a few minutes. There seems to be no way to achieve this using the kernel oom killer.

However, in the user space, we can do whatever we want. So i wrote the Early OOM Daemon ( https://github.com/rfjakob/earlyoom ) that will kill the largest process (by RSS) once the available RAM goes below 10%.

Without earlyoom, it has been easy to lock up my machine (8GB RAM) by starting http://www.unrealengine.com/html5/ a few times. Now, the guilty browser tabs get killed before things get out of hand.

CentOS – Pull All Process’s Swapped Memory Out of Swap

You can achieve the same result by using GDB's 'dump memory' command and have it write to /dev/null.

You just need to find the regions in /proc/$PID/smaps that need to be unswapped. example from /proc/$PID/smaps:

02205000-05222000 rw-p 00000000 00:00 0 
Size:              49268 kB
Rss:               15792 kB
Pss:                9854 kB
Shared_Clean:          0 kB
Shared_Dirty:      11876 kB
Private_Clean:         0 kB
Private_Dirty:      3916 kB
Referenced:          564 kB
Anonymous:         15792 kB
AnonHugePages:         0 kB
Swap:              33276 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB

and then use --batch mode to execute the gdb command so you can use it in your function:

[root@nunya ~]# swapon -s ; gdb --batch --pid 33795 -ex "dump memory /dev/null 0x02205000 0x05222000" ;swapon -s
Filename                Type        Size    Used    Priority
/dev/sda2                               partition   7811068 7808096 -1

[Thread debugging using libthread_db enabled]

Filename                Type        Size    Used    Priority
/dev/sda2                               partition   7811068 7796012 -1

Best Answer

The solution:

Related Solutions

Linux – Is it possible to trigger OOM-killer on forced swapping

CentOS – Pull All Process’s Swapped Memory Out of Swap

Related Question