Linux Memory – Why Atop Shows Swapping with Free Memory

atoplinuxmemoryswap

Why did atop show that I was swapping out over 20,000 pages – over 80 megabytes – when I had gigabytes of free memory?

I have not noticed a performance problem with this. I simply wish to take the opportunity to increase my knowledge :-).

atop refreshes every ten seconds. Each refresh shows the activity since the last refresh.

MEM | tot     7.7G | free    3.7G | cache 608.3M | buff   19.1M | slab  264.6M |
SWP | tot     2.0G | free    1.4G |              | vmcom  13.4G | vmlim   5.8G |
PAG | scan  167264 | steal 109112 | stall      0 | swin       0 | swout  23834 |

                                "swout" is non-zero and coloured in red  ^

Kernel meminfo:

$ head -n 5 /proc/meminfo
MemTotal:        8042664 kB
MemFree:         3563088 kB
MemAvailable:    3922092 kB
Buffers:           20484 kB
Cached:           775308 kB

Kernel version:

$ uname -r
5.0.16-200.fc29.x86_64

  1. It is not clear that this would be affected by vm.swappiness. That setting balances cache reclaim v.s. swapping. However there is plenty of free memory, so why would we need to reclaim any memory in the first place?

  2. As you can see, this is a small system. It does not use NUMA. I checked in /proc/zoneinfo and I only have one node, "Node 0". So this is not caused by NUMA.

  3. Related questions and answers mention the idea of "opportunistic swapping", "when the system has nothing better to do", "which may provide a benefit if there's a memory shortage later", etc. I do not find these ideas credible, because they contradict the kernel documentation. See Does Linux perform "opportunistic swapping", or is it a myth?

  4. There are no limits set on RAM usage using systemd.resources features. I.e. I think all systemd units have their RAM usage limit set to "infinity".

    $ systemctl show '*' | \
        grep -E '(Memory|Swap).*(Max|Limit|High)' | \
        grep -v infinity
    $
    
  5. Edit: I suspect this is related to transparent huge pages. I note that Virtual Machines use transparent huge pages to allocate the guest memory efficiently. They are the only user program that uses huge pages on my system.

    There is a similar-looking question: Can kswapd be active if free memory well exceeds pages_high watermark? It is asking about RHEL 6, which enables huge pages for all applications.

I am not sure exactly how to reproduce this result.

It happened when starting a VM. I use libvirt to run VMs. As per the default, VM disk reads are cached using the host page cache. (Cache mode: "Hypervisor default" means "Writeback").

I tried to stop the VM, FADVISE_DONTNEED the image file, and try again. But the same thing did not happen.

Then I tried again with a different VM, and it happened briefly. I captured vmstat. I think atop showed a different, higher figure for "swout", but I did not capture it.

$ vmstat 10
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0 770168 5034300  28024 936256    0    2    21    86   60  101 22  9 68  1  0
 0  0 770168 5033852  28048 935904    0    0     0     8  372  633  1  1 97  0  0
 1  0 770168 4974048  28252 948152    3    0  1137   194 1859 2559 11  7 79  3  0
 0  1 770168 4557968  28308 1037512    0    0  8974    45 3595 6218 16  6 57 21  0
 6  3 770168 4317800  28324 1111408    0    0  7200   609 6820 6793 12  5 38 44  0
 0  4 770168 4125100  28348 1182624    0    0  6900   269 5054 3674 74  3  8 15  0
 0  5 770168 3898200  27840 1259768    2    0  9421   630 4694 5465 11  6 11 71  0
 1  3 770168 3760316  27672 1300540    0    0  9294   897 3308 4135  5  4 28 63  0
 0  1 770168 3531332  27032 1356236    0    0 10532   155 3140 4949  8  5 63 25  0
 0  0 783772 3381556  27032 1320296    0 1390  7320  4210 4450 5112 17  5 43 35  0
 0  0 783772 3446284  27056 1335116    0    0   239   441  995 1782  4  2 92  2  0
 0  0 783772 3459688  27076 1335372    0    0     3   410  728 1037  2  2 95  1  0

I also checked for a cgroup memory limit on the VM, on the off-chance that libvirt had bypassed systemd, and inflicted swapping on itself by mistake:

$ cd /sys/fs/cgroup/memory/machine.slice/machine-qemu\x2d5\x2ddebian9.scope
$ find -type d  # there were no sub-directories here
$ grep -H . *limit_in_bytes
memory.kmem.limit_in_bytes:9223372036854771712
memory.kmem.tcp.limit_in_bytes:9223372036854771712
memory.limit_in_bytes:9223372036854771712
memory.memsw.limit_in_bytes:9223372036854771712
memory.soft_limit_in_bytes:9223372036854771712
$ cd ../..
$ find -name "*limit_in_bytes" -exec grep -H -v 9223372036854771712 \{\} \;
$

Best Answer

I was pondering over a similar question -- you saw my thread about kswapd and zone watermarks -- and the answer in my case (and probably in yours as well) is memory fragmentation.

When memory is fragmented enough, higher order allocation will fail, and this (depending on a number of additional factors) will either lead to direct reclaim, or will wake kswapd which will attempt to do zone reclaim/compaction. You can find some additional details in my thread.

Another thing that may escape attention when dealing with such problems is memory zoning. I.e. you may have enough memory overall (and it might even contain enough contiguous chunks) but it may be restricted to DMA32 (if you're on 64-bit architecture). Some people tend to ignore DMA32 as "small" (probably because they are used to 32-bit thinking) but 4GB is not really "small".

You have two ways of finding out for sure what's going on in your case. One is analyzing stats -- you can set up jobs to take periodic snapshots of /proc/buddyinfo, /proc/zoneinfo, /proc/vmstat etc., and try to make sense out of what you're seeing.

The other way is more direct and reliable if you get it to work: you need to capture the codepaths that lead to swapout events, and you can do it using tracepoints the kernel is instrumented with (in particular, there are numerous vmscan events).

But getting it to work may be challenging, as low-level instrumentation doesn't always work the way it's supposed to out of the box. In my case, we had to spend some time setting up ftrace infrastructure only to find out in the end that function_graph probe that we needed wasn't working for some reason. The next tool we tried was perf, and it too wasn't successful on the first attempt. But then when you eventually manage to capture events of interest, they are likely to lead you to the answer much faster than any global counters.

Best regards, Nikolai

Related Question