shared Memory used (mostly) by tmpfs (Shmem in /proc/meminfo, available on kernels 2.6.32, displayed as zero if not available)>
So the manpage definition of Shared
is not as helpful as it could be :(. If the tmpfs use does not reflect this high value of Shared, then the value must represent some process(es) "who did mmap() with MAP_SHARED|MAP_ANONYMOUS" (or System V shared memory).
6G of shared memory on an 8G system is still a lot. Seriously, you don't want that, at least not on a desktop.
It's weird that it seems to contribute to "buff/cache" as well. But I did a quick test with python and that's just how it works.
To show the processes with the most shared memory, use top -o SHR -n 1
.
System V shared memory
Finally it's possible you have some horrible legacy software that uses system V shared memory segments. If they get leaked, they won't show up in top
:(.
You can list them with ipcs -m -t
. Hopefully the most recently created one is still in use. Take the shmid number and e.g.
$ ipcs -m -t
------ Shared Memory Attach/Detach/Change Times --------
shmid owner attached detached changed
3538944 alan Apr 30 20:35:15 Apr 30 20:35:15 Apr 30 16:07:41
3145729 alan Apr 30 20:35:15 Apr 30 20:35:15 Apr 30 15:04:09
4587522 alan Apr 30 20:37:38 Not set Apr 30 20:37:38
# sudo grep 4587522 /proc/*/maps
-> then the numbers shown in the /proc paths are the pid of the processes that use the SHM. (So you could e.g. grep the output of ps for that pid number).
Apparent contradictions
Xorg has 8G mapped. Even though you don't have separate video card RAM. It only has 150M resident. It's not that the rest is swapped out, because you don't have enough swap space.
The SHM segments shown by ipcs
are all attached to two processes. So none of them have leaked, and they should all show up in the SHR column of top
(double-counted even). It's ok if the number of pages used is less than the size of the memory segment, that just means there are pages that haven't been used. But free
says we have 6GB of allocated shared memory to account for, and we can't find that.
Linux does not do "opportunistic swapping" as defined in this question.
The following primary references do not mention the concept at all:
- Understanding the Linux Virtual Memory Manager. An online book by Mel Gorman. Written in 2003, just before the release of Linux 2.6.0.
- Documentation/admin-guide/sysctl/vm.rst. This is the primary documentation of the tunable settings of Linux virtual memory management.
More specifically:
10.6 Pageout Daemon (kswapd)
Historically kswapd
used to wake up every 10 seconds but now it is only woken by the physical page allocator when the pages_low number of free pages in a zone is reached. [...] Under extreme memory pressure, processes will do the work of kswapd
synchronously. [...] kswapd
keeps freeing pages until the pages_high watermark is reached.
Based on the above, we would not expect any swapping when the number of free pages is higher than the "high watermark".
Secondly, this tells us the purpose of kswapd
is to make more free pages.
When kswapd
writes a memory page to swap, it immediately frees the memory page. kswapd does not keep a copy of the swapped page in memory.
Linux 2.6 uses the "rmap" to free the page. In Linux 2.4, the story was more complex. When a page was shared by multiple processes, kswapd was not able to free it immediately. This is ancient history. All of the linked posts are about Linux 2.6 or above.
swappiness
This control is used to define how aggressive the kernel will swap
memory pages. Higher values will increase aggressiveness, lower values
decrease the amount of swap. A value of 0 instructs the kernel not to
initiate swap until the amount of free and file-backed pages is less
than the high water mark in a zone.
This quote describes a special case: if you configure the swappiness
value to be 0
. In this case, we should additionally not expect any swapping until the number of cache pages has fallen to the high watermark. In other words, the kernel will try to discard almost all file cache before it starts swapping. (This might cause massive slowdowns. You need to have some file cache! The file cache is used to hold the code of all your running programs :-)
What are the watermarks?
The above quotes raise the question: How large are the "watermark" memory reservations on my system? Answer: on a "small" system, the default zone watermarks might be as high as 3% of memory. This is due to the calculation of the "min" watermark. On larger systems the watermarks will be a smaller proportion, approaching 0.3% of memory.
So if the question is about a system with more than 10% free memory, the exact details of this watermark logic are not significant.
The watermarks for each individual "zone" are shown in /proc/zoneinfo
, as documented in proc(5). An extract from my zoneinfo:
Node 0, zone DMA32
pages free 304988
min 7250
low 9062
high 10874
spanned 1044480
present 888973
managed 872457
protection: (0, 0, 4424, 4424, 4424)
...
Node 0, zone Normal
pages free 11977
min 9611
low 12013
high 14415
spanned 1173504
present 1173504
managed 1134236
protection: (0, 0, 0, 0, 0)
The current "watermarks" are min
, low
, and high
. If a program ever asks for enough memory to reduce free
below min
, the program enters "direct reclaim". The program is made to wait while the kernel frees up memory.
We want to avoid direct reclaim if possible. So if free
would dip below the low
watermark, the kernel wakes kswapd
. kswapd
frees memory by swapping and/or dropping caches, until free
is above high
again.
Additional qualification: kswapd
will also run to protect the full lowmem_reserve amount, for kernel lowmem and DMA usage. The default lowmem_reserve is about 1/256 of the first 4GiB of RAM (DMA32 zone), so it is usually around 16MiB.
Linux code commits
mm: scale kswapd watermarks in proportion to memory
[...]
watermark_scale_factor:
This factor controls the aggressiveness of kswapd. It defines the
amount of memory left in a node/system before kswapd is woken up and
how much memory needs to be free before kswapd goes back to sleep.
The unit is in fractions of 10,000. The default value of 10 means the
distances between watermarks are 0.1% of the available memory in the
node/system. The maximum value is 1000, or 10% of memory.
A high rate of threads entering direct reclaim (allocstall) or kswapd
going to sleep prematurely (kswapd_low_wmark_hit_quickly) can indicate
that the number of free pages kswapd maintains for latency reasons is
too small for the allocation bursts occurring in the system. This knob
can then be used to tune kswapd aggressiveness accordingly.
proc: meminfo: estimate available memory more conservatively
The MemAvailable
item in /proc/meminfo
is to give users a hint of how
much memory is allocatable without causing swapping, so it excludes
the zones' low watermarks as unavailable to userspace.
However, for a userspace allocation, kswapd
will actually reclaim
until the free pages hit a combination of the high watermark and the
page allocator's lowmem protection that keeps a certain amount of DMA
and DMA32 memory from userspace as well.
Subtract the full amount we know to be unavailable to userspace from
the number of free pages when calculating MemAvailable.
Linux code
It is sometimes claimed that changing swappiness
to 0
will effectively disable "opportunistic swapping". This provides an interesting avenue of investigation. If there is something called "opportunistic swapping", and it can be tuned by swappiness, then we could chase it down by finding all the call-chains that read vm_swappiness
. Note we can reduce our search space by assuming CONFIG_MEMCG
is not set (i.e. "memory cgroups" are disabled). The call chain goes:
shrink_node_memcg
is commented "This is a basic per-node page freer. Used by both kswapd and direct reclaim". I.e. this function increases the number of free pages. It is not trying to duplicate pages to swap so they can be freed at a much later time. But even if we discount that:
The above chain is called from three different functions, shown below. As expected, we can divide the call-sites into direct reclaim v.s. kswapd. It would not make sense to perform "opportunistic swapping" in direct reclaim.
-
/*
* This is the direct reclaim path, for page-allocating processes. We only
* try to reclaim pages from zones which will satisfy the caller's allocation
* request.
*
* If a zone is deemed to be full of pinned pages then just give it a light
* scan then give up on it.
*/
static void shrink_zones
-
* kswapd shrinks a node of pages that are at or below the highest usable
* zone that is currently unbalanced.
*
* Returns true if kswapd scanned at least the requested number of pages to
* reclaim or if the lack of progress was due to pages under writeback.
* This is used to determine if the scanning priority needs to be raised.
*/
static bool kswapd_shrink_node
-
* For kswapd, balance_pgdat() will reclaim pages across a node from zones
* that are eligible for use by the caller until at least one zone is
* balanced.
*
* Returns the order kswapd finished reclaiming at.
*
* kswapd scans the zones in the highmem->normal->dma direction. It skips
* zones which have free_pages > high_wmark_pages(zone), but once a zone is
* found to have free_pages <= high_wmark_pages(zone), any page in that zone
* or lower is eligible for reclaim until at least one usable zone is
* balanced.
*/
static int balance_pgdat
So, presumably the claim is that kswapd is woken up somehow, even when all memory allocations are being satisfied immediately from free memory. I looked through the uses of wake_up_interruptible(&pgdat->kswapd_wait)
, and I am not seeing any wakeups like this.
Best Answer
Swap space is not necessarily used by specific processes.
Files stored on
tmpfs
based file systems might be using them (tmpfs
first uses RAM as back-end but, not to waste RAM, can paginate out to the swap area blocks that are not actively used).Check the output of :