Linux Memory – Why Atop Shows Swapping with Free Memory

atoplinuxmemoryswap

Why did atop show that I was swapping out over 20,000 pages – over 80 megabytes – when I had gigabytes of free memory?

I have not noticed a performance problem with this. I simply wish to take the opportunity to increase my knowledge :-).

atop refreshes every ten seconds. Each refresh shows the activity since the last refresh.

MEM | tot     7.7G | free    3.7G | cache 608.3M | buff   19.1M | slab  264.6M |
SWP | tot     2.0G | free    1.4G |              | vmcom  13.4G | vmlim   5.8G |
PAG | scan  167264 | steal 109112 | stall      0 | swin       0 | swout  23834 |

                                "swout" is non-zero and coloured in red  ^

Kernel meminfo:

$ head -n 5 /proc/meminfo
MemTotal:        8042664 kB
MemFree:         3563088 kB
MemAvailable:    3922092 kB
Buffers:           20484 kB
Cached:           775308 kB

Kernel version:

$ uname -r
5.0.16-200.fc29.x86_64

It is not clear that this would be affected by vm.swappiness. That setting balances cache reclaim v.s. swapping. However there is plenty of free memory, so why would we need to reclaim any memory in the first place?
As you can see, this is a small system. It does not use NUMA. I checked in /proc/zoneinfo and I only have one node, "Node 0". So this is not caused by NUMA.
Related questions and answers mention the idea of "opportunistic swapping", "when the system has nothing better to do", "which may provide a benefit if there's a memory shortage later", etc. I do not find these ideas credible, because they contradict the kernel documentation. See Does Linux perform "opportunistic swapping", or is it a myth?
There are no limits set on RAM usage using systemd.resources features. I.e. I think all systemd units have their RAM usage limit set to "infinity".
```
$ systemctl show '*' | \
    grep -E '(Memory|Swap).*(Max|Limit|High)' | \
    grep -v infinity
$
```
Edit: I suspect this is related to transparent huge pages. I note that Virtual Machines use transparent huge pages to allocate the guest memory efficiently. They are the only user program that uses huge pages on my system.

There is a similar-looking question: Can kswapd be active if free memory well exceeds pages_high watermark? It is asking about RHEL 6, which enables huge pages for all applications.

I am not sure exactly how to reproduce this result.

It happened when starting a VM. I use libvirt to run VMs. As per the default, VM disk reads are cached using the host page cache. (Cache mode: "Hypervisor default" means "Writeback").

I tried to stop the VM, FADVISE_DONTNEED the image file, and try again. But the same thing did not happen.

Then I tried again with a different VM, and it happened briefly. I captured vmstat. I think atop showed a different, higher figure for "swout", but I did not capture it.

$ vmstat 10
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0 770168 5034300  28024 936256    0    2    21    86   60  101 22  9 68  1  0
 0  0 770168 5033852  28048 935904    0    0     0     8  372  633  1  1 97  0  0
 1  0 770168 4974048  28252 948152    3    0  1137   194 1859 2559 11  7 79  3  0
 0  1 770168 4557968  28308 1037512    0    0  8974    45 3595 6218 16  6 57 21  0
 6  3 770168 4317800  28324 1111408    0    0  7200   609 6820 6793 12  5 38 44  0
 0  4 770168 4125100  28348 1182624    0    0  6900   269 5054 3674 74  3  8 15  0
 0  5 770168 3898200  27840 1259768    2    0  9421   630 4694 5465 11  6 11 71  0
 1  3 770168 3760316  27672 1300540    0    0  9294   897 3308 4135  5  4 28 63  0
 0  1 770168 3531332  27032 1356236    0    0 10532   155 3140 4949  8  5 63 25  0
 0  0 783772 3381556  27032 1320296    0 1390  7320  4210 4450 5112 17  5 43 35  0
 0  0 783772 3446284  27056 1335116    0    0   239   441  995 1782  4  2 92  2  0
 0  0 783772 3459688  27076 1335372    0    0     3   410  728 1037  2  2 95  1  0

I also checked for a cgroup memory limit on the VM, on the off-chance that libvirt had bypassed systemd, and inflicted swapping on itself by mistake:

$ cd /sys/fs/cgroup/memory/machine.slice/machine-qemu\x2d5\x2ddebian9.scope
$ find -type d  # there were no sub-directories here
$ grep -H . *limit_in_bytes
memory.kmem.limit_in_bytes:9223372036854771712
memory.kmem.tcp.limit_in_bytes:9223372036854771712
memory.limit_in_bytes:9223372036854771712
memory.memsw.limit_in_bytes:9223372036854771712
memory.soft_limit_in_bytes:9223372036854771712
$ cd ../..
$ find -name "*limit_in_bytes" -exec grep -H -v 9223372036854771712 \{\} \;
$

Best Answer

I was pondering over a similar question -- you saw my thread about kswapd and zone watermarks -- and the answer in my case (and probably in yours as well) is memory fragmentation.

When memory is fragmented enough, higher order allocation will fail, and this (depending on a number of additional factors) will either lead to direct reclaim, or will wake kswapd which will attempt to do zone reclaim/compaction. You can find some additional details in my thread.

Another thing that may escape attention when dealing with such problems is memory zoning. I.e. you may have enough memory overall (and it might even contain enough contiguous chunks) but it may be restricted to DMA32 (if you're on 64-bit architecture). Some people tend to ignore DMA32 as "small" (probably because they are used to 32-bit thinking) but 4GB is not really "small".

You have two ways of finding out for sure what's going on in your case. One is analyzing stats -- you can set up jobs to take periodic snapshots of /proc/buddyinfo, /proc/zoneinfo, /proc/vmstat etc., and try to make sense out of what you're seeing.

The other way is more direct and reliable if you get it to work: you need to capture the codepaths that lead to swapout events, and you can do it using tracepoints the kernel is instrumented with (in particular, there are numerous vmscan events).

But getting it to work may be challenging, as low-level instrumentation doesn't always work the way it's supposed to out of the box. In my case, we had to spend some time setting up ftrace infrastructure only to find out in the end that function_graph probe that we needed wasn't working for some reason. The next tool we tried was perf, and it too wasn't successful on the first attempt. But then when you eventually manage to capture events of interest, they are likely to lead you to the answer much faster than any global counters.

Best regards, Nikolai

The solution:

For a temporary change, you can directly modify the cgroup, for example for an lxc container named box1 the cgroup is called lxc/box1 and you may execute (as root in the host machine):

$ echo 8G > /sys/fs/cgroup/memory/lxc/box1/memory.limit_in_bytes

The permanent solution is to correctly configure the container in /var/lb/lxc/...

lxc.cgroup.memory.limit_in_bytes = 8G

Moral of the story: always check your configuration. Even if you think it can't possibly be the issue (and takes a different bug/inconsistency in the kernel to actually fail).

Linux – How large are the “watermark” memory reservations on the system

The watermarks are the low and high values in /proc/zoneinfo, shown in units of pages (4096 bytes on x86).

On my 8GB system, most of the pages are split between the DMA32 zone and the Normal zone. (And everything belongs to Node 0, because it is not a NUMA system).

# cat /proc/zoneinfo
Node 0, zone      DMA
...
  pages free     3961
        min      33
        low      41
        high     49
        spanned  4095
        present  3996
        managed  3961
...
Node 0, zone    DMA32
  pages free     139960
        min      7184
        low      8980
        high     10776
        spanned  1044480
        present  888973
        managed  866327
...
Node 0, zone   Normal
  pages free     33907
        min      31449
        low      33868
        high     36287
        spanned  1173504
        present  1173504
        managed  1140349
...

The watermarks are a proportion of managed.

Very broadly speaking, the watermarks on my system are somewhere between 1% and 3%.

See __setup_per_zone_wmarks(). (Also free_area_init_core(), set_dma_reserve(), and the commit mm: introduce new field "managed_pages" to struct zone.)

The kernel may dynamically increase the watermarks (boost_watermark()) if it appears necessary.

The behaviour is tunable by watermark_boost_factor and watermark_scale_factor in Documentation/sysctl/vm.txt. The scale factor defaults to 0.1%, and the boost factor defaults to 150% of the scale factor.

On my system the watermarks are dominated by min. The per-zone min watermark is set proportionally from min_free_kbytes. The kernel had calculated a default min_free_kbytes value of 67584. Related: an explanation of how this value of min_free_kbytes was calculated.

I do not know why the minimum watermark for the "Normal" zone appeared as 31449 pages = 125796 KiB! This appears to contradict the source code. So far, I can only think it was a bug or a hardware fault. See this question: My low and high watermarks seem higher than predicted by Documentation/sysctl/vm.txt

Best Answer

Related Solutions

Linux Debian – Permanent Swapping with Lots of Free Memory

The solution:

Linux – How large are the “watermark” memory reservations on the system

Related Question