Linux Debian – Permanent Swapping with Lots of Free Memory

cgroupsdebianlinuxlxcswap

We have a Linux server running Debian 4.0.5 (Kernel 4.0.0-2) with 32G RAM installed and 16G Swap configured. The system uses lxc containers for compartmentalisation, but that shouldn't matter here. The issue exists inside and out of different containers.

Here's a typical free -h:

              total        used        free      shared  buff/cache   available
Mem:            28G        2.1G         25G         15M        936M         26G
Swap:           15G        1.4G         14G

/proc/meminfo has

Committed_AS:   12951172 kB

So there's plenty of free memory, even if everything allocated was actually used at once. However, the system is instantly paging even running processes.

This is most notable with Gitlab, a Rails application using Unicorn: newly forked Unicorn workers are instantly swapped, and when a request comes in need to be read from disk at ~1400kB/s (data from iotop) and runs into timeouts (30s for now, to get it restarted in time. No normal request should take more than 5s) before it gets loaded into memory completely, thus getting instantly killed. Note that this is just an example, I have seen this happen to redis, amavis, postgres, mysql, java(openjdk) and others.

The system is otherwise in a low-load situation with about 5% CPU utilization and a loadavg around 2 (on 8 cores).

What we tried (in no particular order):

  1. swapoff -a: fails at about 800M still swapped
  2. Reducing swappiness (in steps) using sysctl vm.swappiness=NN. This seems to have no impact at all, we went down to 0% and still exactly the same behaviour exists
  3. Stopping non-essential services (Gitlab, a Jetty-based webapp…), freeing ca. 8G of committed-but-not-mapped memory and bringing Committed_AS down to about 5G. No change at all.
  4. Clearing system caches using sync && echo 3 > /proc/sys/vm/drop_caches. This frees up memory, but does nothing to the swap situation.
  5. Combinations of the above

Restarting the machine to completely disable swap via fstab as a test is not really an option, as some services have availability issues and need planned downtimes, not "poking around"… and also we don't really want to disable swap as a fallback.

I don't see why there is any swapping occuring here. Any ideas what may be going on?


This problem has existed for a while now, but it showed up first during a period of high IO load (long background data processing task), so I can't pinpoint a specific event. This task is done for some days and the problem persists, hence this question.

Best Answer

Remember how I said:

The system uses lxc containers for compartmentalisation, but that shouldn't matter here.

Well, turns out it did matter. Or rather, the cgroups at the heart of lxc matter.

The host machine only sees reboots for kernel upgrades. So, what were the last kernels used? 3.19, replaced by 4.0.5 2 months ago and yesterday with 4.1.3. And what happened yesterday? Processes getting memkilled left, right and center. Checking /var/log/kern.log, the affected processes were in cgroups with 512M memory. Wait, 512M? That can't be right (when the expected requirement is around 4G!). As it turns out, this is exactly what we configured in the lxc configs when setting this all up months ago.

So, what happened is that 3.19 completely ignored the memory limit for cgroups; 4.0.5 always paged if the cgroup required more than allowed (this is the core issue of this question) and only 4.1.3 does a full memkiller-sweep.

The swappiness of the host system had no influence on this, since it never was anywhere near being out of physical memory.

The solution:

For a temporary change, you can directly modify the cgroup, for example for an lxc container named box1 the cgroup is called lxc/box1 and you may execute (as root in the host machine):

$ echo 8G > /sys/fs/cgroup/memory/lxc/box1/memory.limit_in_bytes

The permanent solution is to correctly configure the container in /var/lb/lxc/...

lxc.cgroup.memory.limit_in_bytes = 8G

Moral of the story: always check your configuration. Even if you think it can't possibly be the issue (and takes a different bug/inconsistency in the kernel to actually fail).