Linux – Does Opportunistic Swapping Occur or Is It a Myth?

linuxmemoryswap

Suppose a program asks for some memory, but there is not enough free memory left. There are several different ways Linux could respond. One response is to select some other used memory, which has not been accessed recently, and move this inactive memory to swap.

However, I see many articles and comments that go beyond this. They say even when there is a large amount of free memory, Linux will sometimes decide to write inactive memory to swap. Writing to swap in advance means that when we eventually want to use this memory, we do not have to wait for a disk write. They say this is a deliberate strategy to optimize performance.

Are they right? Or is it a myth? Cite your source(s).

Please understand this question using the following definitions:

  • free memory – the "free" memory displayed by the free command. This is the MemFree value from /proc/meminfo. /proc/meminfo is a virtual text file provided by the kernel. See proc(5), or RHEL docs.
  • even when there is a large amount of free memory – for the purpose of argument, imagine there is more than 10% free memory.

References

Here are some search terms: linux "opportunistic swapping" OR (swap "when the system has nothing better to do" OR "when it has nothing better to do" OR "when the system is idle" OR "during idle time")

In the second-highest result on Google, a StackExchange user asks "Why use swap when there is more than enough free space in RAM?", and copies the results of the free command showing about 20% free memory. In response to this specific question, I see this answer is highly voted:

Linux starts swapping before the RAM is filled up. This is done to
improve performance and responsiveness:

  • Performance is increased because sometimes RAM is better used for disk cache than to store program memory. So it's better to swap out a
    program that's been inactive for a while, and instead keep often-used
    files in cache.

  • Responsiveness is improved by swapping pages out when the system is idle, rather than when the memory is full and some program is running
    and requesting more RAM to complete a task.

Swapping does slow the system down, of course — but the alternative to
swapping isn't not swapping, it's having more RAM or using less RAM.

The first result on Google has been marked as a duplicate of the question above :-). In this case, the asker copied details showing 7GB MemFree, out of 16GB. The question has an accepted and upvoted answer of its own:

Swapping only when there is no free memory is only the case if you set swappiness to 0. Otherwise, during idle time, the kernel will swap memory. In doing this the data is not removed from memory, but rather a copy is made in the swap partition.

This means that, should the situation arise that memory is depleted, it does not have to write to disk then and there. In this case the kernel can just overwrite the memory pages which have already been swapped, for which it knows that it has a copy of the data.

The swappiness parameter basically just controls how much it does this.

The other quote does not explicitly claim the swapped data is retained in memory as well. But it seems like you would prefer that approach, if you are swapping even at times when you have 20% free memory, and the reason you are doing so is to improve performance.

As far as I know, Linux does support keeping a copy of the same data in both main memory and swap space.

I also noticed the common claim that "opportunistic swapping" happens "during idle time". I understand it's supposed to help reassure me that this feature is generally good for performance. I don't include this in my definition above, because I think it already has enough details to make a nice clear question. I don't want to make this more complicated than it needs to be.

Original motivation

atop shows `swout` (swapping) when I have gigabytes of free memory. Why?

There are a couple of reports like this, of Linux writing to swap when there is plenty of free memory. "Opportunistic swapping" might explain these reports. At the same time, at least one alternative cause was suggested. As a first step in looking at possible causes: Does Linux ever perform "opportunistic swapping" as defined above?

In the example I reported, the question has now been answered. The cause was not opportunistic swapping.

Best Answer

Linux does not do "opportunistic swapping" as defined in this question.


The following primary references do not mention the concept at all:

  1. Understanding the Linux Virtual Memory Manager. An online book by Mel Gorman. Written in 2003, just before the release of Linux 2.6.0.
  2. Documentation/admin-guide/sysctl/vm.rst. This is the primary documentation of the tunable settings of Linux virtual memory management.

More specifically:

10.6 Pageout Daemon (kswapd)

Historically kswapd used to wake up every 10 seconds but now it is only woken by the physical page allocator when the pages_low number of free pages in a zone is reached. [...] Under extreme memory pressure, processes will do the work of kswapd synchronously. [...] kswapd keeps freeing pages until the pages_high watermark is reached.

Based on the above, we would not expect any swapping when the number of free pages is higher than the "high watermark".

Secondly, this tells us the purpose of kswapd is to make more free pages.

When kswapd writes a memory page to swap, it immediately frees the memory page. kswapd does not keep a copy of the swapped page in memory.

Linux 2.6 uses the "rmap" to free the page. In Linux 2.4, the story was more complex. When a page was shared by multiple processes, kswapd was not able to free it immediately. This is ancient history. All of the linked posts are about Linux 2.6 or above.

swappiness

This control is used to define how aggressive the kernel will swap memory pages. Higher values will increase aggressiveness, lower values decrease the amount of swap. A value of 0 instructs the kernel not to initiate swap until the amount of free and file-backed pages is less than the high water mark in a zone.

This quote describes a special case: if you configure the swappiness value to be 0. In this case, we should additionally not expect any swapping until the number of cache pages has fallen to the high watermark. In other words, the kernel will try to discard almost all file cache before it starts swapping. (This might cause massive slowdowns. You need to have some file cache! The file cache is used to hold the code of all your running programs :-)

What are the watermarks?

The above quotes raise the question: How large are the "watermark" memory reservations on my system? Answer: on a "small" system, the default zone watermarks might be as high as 3% of memory. This is due to the calculation of the "min" watermark. On larger systems the watermarks will be a smaller proportion, approaching 0.3% of memory.

So if the question is about a system with more than 10% free memory, the exact details of this watermark logic are not significant.

The watermarks for each individual "zone" are shown in /proc/zoneinfo, as documented in proc(5). An extract from my zoneinfo:

Node 0, zone    DMA32
  pages free     304988
        min      7250
        low      9062
        high     10874
        spanned  1044480
        present  888973
        managed  872457
        protection: (0, 0, 4424, 4424, 4424)
...
Node 0, zone   Normal
  pages free     11977
        min      9611
        low      12013
        high     14415
        spanned  1173504
        present  1173504
        managed  1134236
        protection: (0, 0, 0, 0, 0)

The current "watermarks" are min, low, and high. If a program ever asks for enough memory to reduce free below min, the program enters "direct reclaim". The program is made to wait while the kernel frees up memory.

We want to avoid direct reclaim if possible. So if free would dip below the low watermark, the kernel wakes kswapd. kswapd frees memory by swapping and/or dropping caches, until free is above high again.


Additional qualification: kswapd will also run to protect the full lowmem_reserve amount, for kernel lowmem and DMA usage. The default lowmem_reserve is about 1/256 of the first 4GiB of RAM (DMA32 zone), so it is usually around 16MiB.

Linux code commits

mm: scale kswapd watermarks in proportion to memory

[...]

watermark_scale_factor:

This factor controls the aggressiveness of kswapd. It defines the amount of memory left in a node/system before kswapd is woken up and how much memory needs to be free before kswapd goes back to sleep.

The unit is in fractions of 10,000. The default value of 10 means the distances between watermarks are 0.1% of the available memory in the node/system. The maximum value is 1000, or 10% of memory.

A high rate of threads entering direct reclaim (allocstall) or kswapd going to sleep prematurely (kswapd_low_wmark_hit_quickly) can indicate that the number of free pages kswapd maintains for latency reasons is too small for the allocation bursts occurring in the system. This knob can then be used to tune kswapd aggressiveness accordingly.

proc: meminfo: estimate available memory more conservatively

The MemAvailable item in /proc/meminfo is to give users a hint of how much memory is allocatable without causing swapping, so it excludes the zones' low watermarks as unavailable to userspace.

However, for a userspace allocation, kswapd will actually reclaim until the free pages hit a combination of the high watermark and the page allocator's lowmem protection that keeps a certain amount of DMA and DMA32 memory from userspace as well.

Subtract the full amount we know to be unavailable to userspace from the number of free pages when calculating MemAvailable.

Linux code

It is sometimes claimed that changing swappiness to 0 will effectively disable "opportunistic swapping". This provides an interesting avenue of investigation. If there is something called "opportunistic swapping", and it can be tuned by swappiness, then we could chase it down by finding all the call-chains that read vm_swappiness. Note we can reduce our search space by assuming CONFIG_MEMCG is not set (i.e. "memory cgroups" are disabled). The call chain goes:

shrink_node_memcg is commented "This is a basic per-node page freer. Used by both kswapd and direct reclaim". I.e. this function increases the number of free pages. It is not trying to duplicate pages to swap so they can be freed at a much later time. But even if we discount that:

The above chain is called from three different functions, shown below. As expected, we can divide the call-sites into direct reclaim v.s. kswapd. It would not make sense to perform "opportunistic swapping" in direct reclaim.

  1. /*
     * This is the direct reclaim path, for page-allocating processes.  We only
     * try to reclaim pages from zones which will satisfy the caller's allocation
     * request.
     *
     * If a zone is deemed to be full of pinned pages then just give it a light
     * scan then give up on it.
     */
    static void shrink_zones
    
  2.  * kswapd shrinks a node of pages that are at or below the highest usable
     * zone that is currently unbalanced.
     *
     * Returns true if kswapd scanned at least the requested number of pages to
     * reclaim or if the lack of progress was due to pages under writeback.
     * This is used to determine if the scanning priority needs to be raised.
     */
    static bool kswapd_shrink_node
    
  3.  * For kswapd, balance_pgdat() will reclaim pages across a node from zones
     * that are eligible for use by the caller until at least one zone is
     * balanced.
     *
     * Returns the order kswapd finished reclaiming at.
     *
     * kswapd scans the zones in the highmem->normal->dma direction.  It skips
     * zones which have free_pages > high_wmark_pages(zone), but once a zone is
     * found to have free_pages <= high_wmark_pages(zone), any page in that zone
     * or lower is eligible for reclaim until at least one usable zone is
     * balanced.
     */
    static int balance_pgdat
    

So, presumably the claim is that kswapd is woken up somehow, even when all memory allocations are being satisfied immediately from free memory. I looked through the uses of wake_up_interruptible(&pgdat->kswapd_wait), and I am not seeing any wakeups like this.

Related Question