Linux – using the swap space

linux-kernelswap

On a Debian Linux 3.16 machine, I have 244 MB of swap space used:

# free -h
             total       used       free     shared    buffers     cached
Mem:           94G        36G        57G       1.9G       3.8G        11G
-/+ buffers/cache:        20G        73G
Swap:         487M       244M       243M

Looking at this, I cannot find 244 MB used.

# for file in /proc/*/status ; do grep VmSwap $file; done | sort -nk 2 | tail
VmSwap:        0 kB
VmSwap:        0 kB
VmSwap:        0 kB
VmSwap:        0 kB
VmSwap:        0 kB
VmSwap:        0 kB
VmSwap:        4 kB
VmSwap:       12 kB
VmSwap:       16 kB
VmSwap:       36 kB

And I only have 34 MB of SwapCached:

# grep -i swap /proc/meminfo
SwapCached:        34584 kB
SwapTotal:        499708 kB
SwapFree:         249388 kB

Kernel doc says about this:

SwapCached: Memory that once was swapped
out, is swapped back in but still also is in the swapfile (if memory
is needed it doesn't need to be swapped out AGAIN because it is
already in the swapfile. This saves I/O)

How can I know which process is using my swap space on my Linux system? More precisely: Where are consumed each of those 244 MB of swap?

Best Answer

How can I know which process is using my swap space on my Linux system?

Swap space is not necessarily used by specific processes.

More precisely: Where are consumed each of those 244 MB of swap?

Files stored on tmpfs based file systems might be using them (tmpfs first uses RAM as back-end but, not to waste RAM, can paginate out to the swap area blocks that are not actively used).

Check the output of :

df -ht tmpfs

System V shared memory

Finally it's possible you have some horrible legacy software that uses system V shared memory segments. If they get leaked, they won't show up in top :(.

You can list them with ipcs -m -t. Hopefully the most recently created one is still in use. Take the shmid number and e.g.

$ ipcs -m -t

------ Shared Memory Attach/Detach/Change Times --------
shmid      owner      attached             detached             changed             
3538944    alan       Apr 30 20:35:15      Apr 30 20:35:15      Apr 30 16:07:41     
3145729    alan       Apr 30 20:35:15      Apr 30 20:35:15      Apr 30 15:04:09     
4587522    alan       Apr 30 20:37:38      Not set              Apr 30 20:37:38     

# sudo grep 4587522 /proc/*/maps

-> then the numbers shown in the /proc paths are the pid of the processes that use the SHM. (So you could e.g. grep the output of ps for that pid number).

Apparent contradictions

Xorg has 8G mapped. Even though you don't have separate video card RAM. It only has 150M resident. It's not that the rest is swapped out, because you don't have enough swap space.
The SHM segments shown by ipcs are all attached to two processes. So none of them have leaked, and they should all show up in the SHR column of top (double-counted even). It's ok if the number of pages used is less than the size of the memory segment, that just means there are pages that haven't been used. But free says we have 6GB of allocated shared memory to account for, and we can't find that.

Linux – Does Opportunistic Swapping Occur or Is It a Myth?

Linux does not do "opportunistic swapping" as defined in this question.

The following primary references do not mention the concept at all:

Understanding the Linux Virtual Memory Manager. An online book by Mel Gorman. Written in 2003, just before the release of Linux 2.6.0.
Documentation/admin-guide/sysctl/vm.rst. This is the primary documentation of the tunable settings of Linux virtual memory management.

More specifically:

10.6 Pageout Daemon (kswapd)

Historically kswapd used to wake up every 10 seconds but now it is only woken by the physical page allocator when the pages_low number of free pages in a zone is reached. [...] Under extreme memory pressure, processes will do the work of kswapd synchronously. [...] kswapd keeps freeing pages until the pages_high watermark is reached.

Based on the above, we would not expect any swapping when the number of free pages is higher than the "high watermark".

Secondly, this tells us the purpose of kswapd is to make more free pages.

When kswapd writes a memory page to swap, it immediately frees the memory page. kswapd does not keep a copy of the swapped page in memory.

Linux 2.6 uses the "rmap" to free the page. In Linux 2.4, the story was more complex. When a page was shared by multiple processes, kswapd was not able to free it immediately. This is ancient history. All of the linked posts are about Linux 2.6 or above.

swappiness

This control is used to define how aggressive the kernel will swap memory pages. Higher values will increase aggressiveness, lower values decrease the amount of swap. A value of 0 instructs the kernel not to initiate swap until the amount of free and file-backed pages is less than the high water mark in a zone.

This quote describes a special case: if you configure the swappiness value to be 0. In this case, we should additionally not expect any swapping until the number of cache pages has fallen to the high watermark. In other words, the kernel will try to discard almost all file cache before it starts swapping. (This might cause massive slowdowns. You need to have some file cache! The file cache is used to hold the code of all your running programs :-)

What are the watermarks?

The above quotes raise the question: How large are the "watermark" memory reservations on my system? Answer: on a "small" system, the default zone watermarks might be as high as 3% of memory. This is due to the calculation of the "min" watermark. On larger systems the watermarks will be a smaller proportion, approaching 0.3% of memory.

So if the question is about a system with more than 10% free memory, the exact details of this watermark logic are not significant.

The watermarks for each individual "zone" are shown in /proc/zoneinfo, as documented in proc(5). An extract from my zoneinfo:

Node 0, zone    DMA32
  pages free     304988
        min      7250
        low      9062
        high     10874
        spanned  1044480
        present  888973
        managed  872457
        protection: (0, 0, 4424, 4424, 4424)
...
Node 0, zone   Normal
  pages free     11977
        min      9611
        low      12013
        high     14415
        spanned  1173504
        present  1173504
        managed  1134236
        protection: (0, 0, 0, 0, 0)

The current "watermarks" are min, low, and high. If a program ever asks for enough memory to reduce free below min, the program enters "direct reclaim". The program is made to wait while the kernel frees up memory.

We want to avoid direct reclaim if possible. So if free would dip below the low watermark, the kernel wakes kswapd. kswapd frees memory by swapping and/or dropping caches, until free is above high again.

Additional qualification: kswapd will also run to protect the full lowmem_reserve amount, for kernel lowmem and DMA usage. The default lowmem_reserve is about 1/256 of the first 4GiB of RAM (DMA32 zone), so it is usually around 16MiB.

Linux code commits

mm: scale kswapd watermarks in proportion to memory

[...]

watermark_scale_factor:

This factor controls the aggressiveness of kswapd. It defines the amount of memory left in a node/system before kswapd is woken up and how much memory needs to be free before kswapd goes back to sleep.

The unit is in fractions of 10,000. The default value of 10 means the distances between watermarks are 0.1% of the available memory in the node/system. The maximum value is 1000, or 10% of memory.

A high rate of threads entering direct reclaim (allocstall) or kswapd going to sleep prematurely (kswapd_low_wmark_hit_quickly) can indicate that the number of free pages kswapd maintains for latency reasons is too small for the allocation bursts occurring in the system. This knob can then be used to tune kswapd aggressiveness accordingly.

proc: meminfo: estimate available memory more conservatively

The MemAvailable item in /proc/meminfo is to give users a hint of how much memory is allocatable without causing swapping, so it excludes the zones' low watermarks as unavailable to userspace.

However, for a userspace allocation, kswapd will actually reclaim until the free pages hit a combination of the high watermark and the page allocator's lowmem protection that keeps a certain amount of DMA and DMA32 memory from userspace as well.

Subtract the full amount we know to be unavailable to userspace from the number of free pages when calculating MemAvailable.

Linux code

It is sometimes claimed that changing swappiness to 0 will effectively disable "opportunistic swapping". This provides an interesting avenue of investigation. If there is something called "opportunistic swapping", and it can be tuned by swappiness, then we could chase it down by finding all the call-chains that read vm_swappiness. Note we can reduce our search space by assuming CONFIG_MEMCG is not set (i.e. "memory cgroups" are disabled). The call chain goes:

shrink_node_memcg is commented "This is a basic per-node page freer. Used by both kswapd and direct reclaim". I.e. this function increases the number of free pages. It is not trying to duplicate pages to swap so they can be freed at a much later time. But even if we discount that:

The above chain is called from three different functions, shown below. As expected, we can divide the call-sites into direct reclaim v.s. kswapd. It would not make sense to perform "opportunistic swapping" in direct reclaim.

/*
 * This is the direct reclaim path, for page-allocating processes.  We only
 * try to reclaim pages from zones which will satisfy the caller's allocation
 * request.
 *
 * If a zone is deemed to be full of pinned pages then just give it a light
 * scan then give up on it.
 */
static void shrink_zones

 * kswapd shrinks a node of pages that are at or below the highest usable
 * zone that is currently unbalanced.
 *
 * Returns true if kswapd scanned at least the requested number of pages to
 * reclaim or if the lack of progress was due to pages under writeback.
 * This is used to determine if the scanning priority needs to be raised.
 */
static bool kswapd_shrink_node

 * For kswapd, balance_pgdat() will reclaim pages across a node from zones
 * that are eligible for use by the caller until at least one zone is
 * balanced.
 *
 * Returns the order kswapd finished reclaiming at.
 *
 * kswapd scans the zones in the highmem->normal->dma direction.  It skips
 * zones which have free_pages > high_wmark_pages(zone), but once a zone is
 * found to have free_pages <= high_wmark_pages(zone), any page in that zone
 * or lower is eligible for reclaim until at least one usable zone is
 * balanced.
 */
static int balance_pgdat

So, presumably the claim is that kswapd is woken up somehow, even when all memory allocations are being satisfied immediately from free memory. I looked through the uses of wake_up_interruptible(&pgdat->kswapd_wait), and I am not seeing any wakeups like this.