Seeing another of your post I guess you are using zram. So that will be my assumption here.
I did the experience to install zram and consume lot of memory, and I got the same output of smem
than you. smem
does not take into account zram
into its counting, it only uses /proc/meminfo
to compute its value, and if you look and try to understand the code you will see that the zram RAM occupation is gets in the end counted under the noncache column of the kernel dynamic memory line.
Further investigations
Following my gut feeling that zram was behind this behavious, I setted up a VM with similar spec as your machine: 4 GB RAM and 2 GB zram swap, no swap file.
I have loaded the VM with heavy weight applications and got the following state:
huygens@ubuntu:~$ smem -wt -K ~/vmlinuz-3.2.0-38-generic.unpacked -R 4096M
Area Used Cache Noncache
firmware/hardware 130717 0 130717
kernel image 13951 0 13951
kernel dynamic memory 1063520 922172 141348
userspace memory 2534684 257136 2277548
free memory 451432 451432 0
----------------------------------------------------------
4194304 1630740 2563564
huygens@ubuntu:~$ free -m
total used free shared buffers cached
Mem: 3954 3528 426 0 79 858
-/+ buffers/cache: 2589 1365
Swap: 1977 0 1977
As you can see free
reports 858 MB cache memory and that is also what smem
seems to report within the cached kernel dynamic memory.
Then I further stressed the system using Chromium Browser. At the beginning, it was only have 83 MB of swap used. But then after a few more tabs opened, the swap switch quickly to almost it's maximum and I experienced OOM! zram
has really a dangerous side where wrongly configured (too big sizes) it can quickly hit you back like a trebuchet-like mechanism.
At that time I had the following outputs:
huygens@ubuntu:~$ smem -wt -K ~/vmlinuz-3.2.0-38-generic.unpacked -R 4096M
Area Used Cache Noncache
firmware/hardware 130717 0 130717
kernel image 13951 0 13951
kernel dynamic memory 1355344 124072 1231272
userspace memory 961004 36456 924548
free memory 1733288 1733288 0
----------------------------------------------------------
4194304 1893816 2300488
huygens@ubuntu:~$ free -m
total used free shared buffers cached
Mem: 3954 2256 1698 0 4 132
-/+ buffers/cache: 2118 1835
Swap: 1977 1750 227
See how the kernel dynamic memory (columns cache and non-cache) look like inverted? It is because in the first case, the kernel had "cached" memory such as reported by free
but then it had swap memory held by zram
which smem
does not know how to compute (check smem source code, zram occupation is not reported in /proc/meminfo, this it is not computed by smem
which does simple "total kernel mem" - "type of memory reported by meminfo that I know are cache", what it does not know is that in the computed total kernel mem it has added the size of the swap which is in RAM!)
When I was in this state, I activated a hard disk swap and turned off the zram swap and I reset the zram devices: echo 1 > /sys/block/zram0/reset
.
After that the noncache kernel memory melted like snow in summer and returned to "normal" value.
Conclusion
smem
does not know about zram
(yet) maybe because it is still staging and thus not part of /proc/meminfo
which reports global parameters (like (in)active pages size, total memory) and then only report on a few specific parameters. smem
identified a few of this specific parameters as "cache", sum them up and compare that to total memory. Because of that zram
used memory gets counted in the noncache column.
Note: by the way, in modern kernel, meminfo
reports also the shared memory consumed. smem
does not take that yet into account, so even without zram
the output of smem
is to consider carefully esp. if you use application that make big use of shared memory.
References used:
- What is the difference between "buffers" and the other type of cache?
- Why is this distinction so prominent? Why do some people say "buffer cache" when they talk about cached file content?
- What are
Buffers
used for?
- Why might we expect
Buffers
in particular to be larger or smaller?
1. What is the difference between "buffers" and the other type of cache?
Buffers
shows the amount of page cache used for block devices. "Block devices" are the most common type of data storage device.
The kernel has to deliberately subtract this amount from the rest of the page cache when it reports Cached
. See meminfo_proc_show():
cached = global_node_page_state(NR_FILE_PAGES) -
total_swapcache_pages() - i.bufferram;
...
show_val_kb(m, "MemTotal: ", i.totalram);
show_val_kb(m, "MemFree: ", i.freeram);
show_val_kb(m, "MemAvailable: ", available);
show_val_kb(m, "Buffers: ", i.bufferram);
show_val_kb(m, "Cached: ", cached);
2. Why is this distinction made so prominent? Why do some people say "buffer cache" when they talk about cached file content?
The page cache works in units of the MMU page size, typically a minimum of 4096 bytes. This is essential for mmap()
, i.e. memory-mapped file access.[1][2] It is designed to share pages of loaded program / library code between separate processes, and allow loading individual pages on demand. (Also for unloading pages when something else needs the space, and they haven't been used recently).
[1] Memory-mapped I/O - The GNU C Library manual.
[2] mmap
- Wikipedia.
Early UNIX had a "buffer cache" of disk blocks, and did not have mmap(). Apparently when mmap() was first added, they added the page cache as a new layer on top. This is as messy as it sounds. Eventually, UNIX-based OS's got rid of the separate buffer cache. So now all file cache is in units of pages. Pages are looked up by (file, offset), not by location on disk. This was called "unified buffer cache", perhaps because people were more familiar with "buffer cache".[3]
[3] UBC: An Efficient Unified I/O and Memory Caching Subsystem for NetBSD
("One interesting twist that Linux adds is that the device block numbers where a page is stored on disk are cached with the page in the form of a list of buffer_head
structures. When a modified page is to be written back to disk, the I/O requests can be sent to the device driver right away, without needing to read any indirect blocks to determine where the page's data should be written."[3])
In Linux 2.2 there was a separate "buffer cache" used for writes, but not for reads. "The page cache used the buffer cache to write back its data, needing an extra copy of the data, and doubling memory requirements for some write loads".[4] Let's not worry too much about the details, but this history would be one reason why Linux reports Buffers
usage separately.
[4] Page replacement in Linux 2.4 memory management, Rik van Riel.
By contrast, in Linux 2.4 and above, the extra copy does not exist. "The system does disk IO directly to and from the page cache page."[4] Linux 2.4 was released in 2001.
3. What are Buffers
used for?
Block devices are treated as files, and so have page cache. This is used "for filesystem metadata and the caching of raw block devices".[4] But in current versions of Linux, filesystems do not copy file contents through it, so there is no "double caching".
I think of the Buffers
part of the page cache as being the Linux buffer cache. Some sources might disagree with this terminology.
How much buffer cache the filesystem uses, if any, depends on the type of filesystem. The system in the question uses ext4. ext3/ext4 use the Linux buffer cache for the journal, for directory contents, and some other metadata.
Certain file systems, including ext3, ext4, and ocfs2, use the jbd or
jbd2 layer to handle their physical block journalling, and this layer
fundamentally uses the buffer cache.
-- Email article by Ted Tso, 2013
Prior to Linux kernel version 2.4, Linux had separate page and buffer caches. Since 2.4, the page and buffer cache are unified and Buffers
is raw disk blocks not represented in the page cache—i.e., not file data.
...
The buffer cache remains, however, as the kernel still needs to perform block I/O in terms of blocks, not pages. As most blocks represent file data, most of the buffer cache is represented by the page cache. But a small amount of block data isn't file backed—metadata and raw block I/O for example—and thus is solely represented by the buffer cache.
-- A pair of Quora answers by Robert Love, last updated 2013.
Both writers are Linux developers who have worked with Linux kernel memory management. The first source is more specific about technical details. The second source is a more general summary, which might be contradicted and outdated in some specifics.
It is true that filesystems may perform partial-page metadata writes, even though the cache is indexed in pages. Even user processes can perform partial-page writes when they use write()
(as opposed to mmap()
), at least directly to a block device. This only applies to writes, not reads. When you read through the page cache, the page cache always reads full pages.
Linus liked to rant that the buffer cache is not required in order to do block-sized writes, and that filesystems can do partial-page metadata writes even with page cache attached to their own files instead of the block device. I am sure he is right to say that ext2 does this. ext3/ext4 with its journalling system does not. It is less clear what the issues were that led to this design. The people he was ranting at got tired of explaining.
ext4_readdir() has not been changed to satisfy Linus' rant. I don't see his desired approach used in readdir() of other filesystems either. I think XFS uses the buffer cache for directories as well. bcachefs does not use the page cache for readdir() at all; it uses its own cache for btrees. I'm not sure about btrfs.
4. Why might we expect Buffers
in particular to be larger or smaller?
In this case it turns out the ext4 journal size for my filesystem is 128M. So this explains why 1) my buffer cache can stabilize at slightly over 128M; 2) buffer cache does not scale proportionally with the larger amount of RAM on my laptop.
For some other possible causes, see What is the buffers column in the output from free? Note that "buffers" reported by free
is actually a combination of Buffers
and reclaimable kernel slab memory.
To verify that journal writes use the buffer cache, I simulated a filesystem in nice fast RAM (tmpfs), and compared the maximum buffer usage for different journal sizes.
# dd if=/dev/zero of=/tmp/t bs=1M count=1000
...
# mkfs.ext4 /tmp/t -J size=256
...
# LANG=C dumpe2fs /tmp/t | grep '^Journal size'
dumpe2fs 1.43.5 (04-Aug-2017)
Journal size: 256M
# mount /tmp/t /mnt
# cd /mnt
# free -w -m
total used free shared buffers cache available
Mem: 7855 2521 4321 285 66 947 5105
Swap: 7995 0 7995
# for i in $(seq 40000); do dd if=/dev/zero of=t bs=1k count=1 conv=sync status=none; sync t; sync -f t; done
# free -w -m
total used free shared buffers cache available
Mem: 7855 2523 3872 551 237 1223 4835
Swap: 7995 0 7995
# dd if=/dev/zero of=/tmp/t bs=1M count=1000
...
# mkfs.ext4 /tmp/t -J size=16
...
# LANG=C dumpe2fs /tmp/t | grep '^Journal size'
dumpe2fs 1.43.5 (04-Aug-2017)
Journal size: 16M
# mount /tmp/t /mnt
# cd /mnt
# free -w -m
total used free shared buffers cache available
Mem: 7855 2507 4337 285 66 943 5118
Swap: 7995 0 7995
# for i in $(seq 40000); do dd if=/dev/zero of=t bs=1k count=1 conv=sync status=none; sync t; sync -f t; done
# free -w -m
total used free shared buffers cache available
Mem: 7855 2509 4290 315 77 977 5086
Swap: 7995 0 7995
History of this answer: How I came to look at the journal
I had found Ted Tso's email first, and was intrigued that it emphasized write caching. I would find it surprising if "dirty", unwritten data was able to reach 30% of RAM on my system. sudo atop
shows that over a 10 second interval, the system in question consistently writes only 1MB. The filesystem concerned would be able to keep up with over 100 times this rate. (It's on a USB2 hard disk drive, max throughput ~20MB/s).
Using blktrace (btrace -w 10 /dev/sda
) confirms that the IOs which are being cached must be writes, because there is almost no data being read. Also that mysqld
is the only userspace process doing IO.
I stopped the service responsible for the writes (icinga2 writing to mysql) and re-checked. I saw "buffers" drop to under 20M - I have no explanation for that - and stay there. Restarting the writer again shows "buffers" rising by ~0.1M for each 10 second interval. I observed it maintain this rate consistently, climbing back to 70M and above.
Running echo 3 | sudo tee /proc/sys/vm/drop_caches
was sufficient to lower "buffers" again, to 4.5M. This proves that my accumulation of buffers is a "clean" cache, which Linux can drop immediately when required. This system is not accumulating unwritten data. (drop_caches
does not perform any writeback and hence cannot drop dirty pages. If you wanted to run a test which cleaned the cache first, you would use the sync
command).
The entire mysql directory is only 150M. The accumulating buffers must represent metadata blocks from mysql writes, but it surprised me to think there would be so many metadata blocks for this data.
Best Answer
That is usually what happens, but not because there is an explicit preference, but because their access count is usually low. The memory subsystem maps disk blocks, physical memory and virtual addresses in a process address space to each other, and the only difference between a buffer or cache page and a process allocation is whether there is a process mapping.
Whenever there is memory pressure, the system evicts the memory pages where the last access was the longest ago, starting with those that have up-to-date copies on disk, then moving on to those where the disk mapping exists and can just be written, and then finally it starts creating new disk mappings by allocating swap space.
In this system, it is beneficial to create swap mappings in advance, before memory ever gets tight. When the system is otherwise idle, it can then copy some pages that haven't been accessed for a while to disk but leave them in memory as well.
This is fundamentally the same as a cache page for a disk block though, except for the mapping that could cause the page to be accessed from a process, resetting the eviction timer. If the process is sleeping and has no work to do, removing this page rather than a cache page that is actively used is usually the better choice.
A lot of caches are accessed only once or twice, so they are good candidates for eviction most of the time without requiring special status.