Linux – 30% of RAM is “buffers”. What is it

cachelinuxmemory

How can I describe or explain "buffers" in the output of free?

$ free -h
              total        used        free      shared  buff/cache   available
Mem:           501M        146M         19M        9.7M        335M        331M
Swap:          1.0G         85M        938M

$ free -w -h
              total        used        free      shared     buffers       cache   available
Mem:           501M        146M         19M        9.7M        155M        180M        331M
Swap:          1.0G         85M        938M

I don't have any (known) problem with this system. I am just surprised and curious to see that "buffers" is almost as high as "cache" (155M v.s. 180M). I thought "cache" represented the page cache of file contents, and tended to be the most significant part of "cache/buffers". I'm not sure what "buffers" are though.

For example, I compared this to my laptop which has more RAM. On my laptop, the "buffers" figure is an order of magnitude smaller than "cache" (200M v.s. 4G). If I understood what "buffers" were then I could start to look at why the buffers grew to such a larger proportion on the smaller system.

From man proc (I ignore the hilariously outdated definition of "large"):

Buffers %lu

Relatively temporary storage for raw disk blocks that shouldn't get tremendously large (20MB or so).

Cached %lu

In-memory cache for files read from the disk (the page cache). Doesn't include SwapCached.


$ free -V
free from procps-ng 3.3.12

$ uname -r  # the Linux kernel version
4.9.0-6-marvell

$ systemd-detect-virt  # this is not inside a virtual machine
none

$ cat /proc/meminfo
MemTotal:         513976 kB
MemFree:           20100 kB
MemAvailable:     339304 kB
Buffers:          159220 kB
Cached:           155536 kB
SwapCached:         2420 kB
Active:           215044 kB
Inactive:         216760 kB
Active(anon):      56556 kB
Inactive(anon):    73280 kB
Active(file):     158488 kB
Inactive(file):   143480 kB
Unevictable:       10760 kB
Mlocked:           10760 kB
HighTotal:             0 kB
HighFree:              0 kB
LowTotal:         513976 kB
LowFree:           20100 kB
SwapTotal:       1048572 kB
SwapFree:         960532 kB
Dirty:               240 kB
Writeback:             0 kB
AnonPages:        126912 kB
Mapped:            40312 kB
Shmem:              9916 kB
Slab:              37580 kB
SReclaimable:      29036 kB
SUnreclaim:         8544 kB
KernelStack:        1472 kB
PageTables:         3108 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     1305560 kB
Committed_AS:    1155244 kB
VmallocTotal:     507904 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB

$ sudo slabtop --once
 Active / Total Objects (% used)    : 186139 / 212611 (87.5%)
 Active / Total Slabs (% used)      : 9115 / 9115 (100.0%)
 Active / Total Caches (% used)     : 66 / 92 (71.7%)
 Active / Total Size (% used)       : 31838.34K / 35031.49K (90.9%)
 Minimum / Average / Maximum Object : 0.02K / 0.16K / 4096.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
 59968  57222   0%    0.06K    937       64      3748K buffer_head            
 29010  21923   0%    0.13K    967       30      3868K dentry                 
 24306  23842   0%    0.58K   4051        6     16204K ext4_inode_cache       
 22072  20576   0%    0.03K    178      124       712K kmalloc-32             
 10290   9756   0%    0.09K    245       42       980K kmalloc-96             
  9152   4582   0%    0.06K    143       64       572K kmalloc-node           
  9027   8914   0%    0.08K    177       51       708K kernfs_node_cache      
  7007   3830   0%    0.30K    539       13      2156K radix_tree_node        
  5952   4466   0%    0.03K     48      124       192K jbd2_revoke_record_s   
  5889   5870   0%    0.30K    453       13      1812K inode_cache            
  5705   4479   0%    0.02K     35      163       140K file_lock_ctx          
  3844   3464   0%    0.03K     31      124       124K anon_vma               
  3280   3032   0%    0.25K    205       16       820K kmalloc-256            
  2730   2720   0%    0.10K     70       39       280K btrfs_trans_handle     
  2025   1749   0%    0.16K     81       25       324K filp                   
  1952   1844   0%    0.12K     61       32       244K kmalloc-128            
  1826    532   0%    0.05K     22       83        88K trace_event_file       
  1392   1384   0%    0.33K    116       12       464K proc_inode_cache       
  1067   1050   0%    0.34K     97       11       388K shmem_inode_cache      
   987    768   0%    0.19K     47       21       188K kmalloc-192            
   848    757   0%    0.50K    106        8       424K kmalloc-512            
   450    448   0%    0.38K     45       10       180K ubifs_inode_slab       
   297    200   0%    0.04K      3       99        12K eventpoll_pwq          
   288    288 100%    1.00K     72        4       288K kmalloc-1024           
   288    288 100%    0.22K     16       18        64K mnt_cache              
   287    283   0%    1.05K     41        7       328K idr_layer_cache        
   240      8   0%    0.02K      1      240         4K fscrypt_info           

Best Answer

  1. What is the difference between "buffers" and the other type of cache?
  2. Why is this distinction so prominent? Why do some people say "buffer cache" when they talk about cached file content?
  3. What are Buffers used for?
  4. Why might we expect Buffers in particular to be larger or smaller?

1. What is the difference between "buffers" and the other type of cache?

Buffers shows the amount of page cache used for block devices. "Block devices" are the most common type of data storage device.

The kernel has to deliberately subtract this amount from the rest of the page cache when it reports Cached. See meminfo_proc_show():

cached = global_node_page_state(NR_FILE_PAGES) -
         total_swapcache_pages() - i.bufferram;
...

show_val_kb(m, "MemTotal:       ", i.totalram);
show_val_kb(m, "MemFree:        ", i.freeram);
show_val_kb(m, "MemAvailable:   ", available);
show_val_kb(m, "Buffers:        ", i.bufferram);
show_val_kb(m, "Cached:         ", cached);

2. Why is this distinction made so prominent? Why do some people say "buffer cache" when they talk about cached file content?

The page cache works in units of the MMU page size, typically a minimum of 4096 bytes. This is essential for mmap(), i.e. memory-mapped file access.[1][2] It is designed to share pages of loaded program / library code between separate processes, and allow loading individual pages on demand. (Also for unloading pages when something else needs the space, and they haven't been used recently).

[1] Memory-mapped I/O - The GNU C Library manual.
[2] mmap - Wikipedia.

Early UNIX had a "buffer cache" of disk blocks, and did not have mmap(). Apparently when mmap() was first added, they added the page cache as a new layer on top. This is as messy as it sounds. Eventually, UNIX-based OS's got rid of the separate buffer cache. So now all file cache is in units of pages. Pages are looked up by (file, offset), not by location on disk. This was called "unified buffer cache", perhaps because people were more familiar with "buffer cache".[3]

[3] UBC: An Efficient Unified I/O and Memory Caching Subsystem for NetBSD

("One interesting twist that Linux adds is that the device block numbers where a page is stored on disk are cached with the page in the form of a list of buffer_head structures. When a modified page is to be written back to disk, the I/O requests can be sent to the device driver right away, without needing to read any indirect blocks to determine where the page's data should be written."[3])

In Linux 2.2 there was a separate "buffer cache" used for writes, but not for reads. "The page cache used the buffer cache to write back its data, needing an extra copy of the data, and doubling memory requirements for some write loads".[4] Let's not worry too much about the details, but this history would be one reason why Linux reports Buffers usage separately.

[4] Page replacement in Linux 2.4 memory management, Rik van Riel.

By contrast, in Linux 2.4 and above, the extra copy does not exist. "The system does disk IO directly to and from the page cache page."[4] Linux 2.4 was released in 2001.

3. What are Buffers used for?

Block devices are treated as files, and so have page cache. This is used "for filesystem metadata and the caching of raw block devices".[4] But in current versions of Linux, filesystems do not copy file contents through it, so there is no "double caching".

I think of the Buffers part of the page cache as being the Linux buffer cache. Some sources might disagree with this terminology.

How much buffer cache the filesystem uses, if any, depends on the type of filesystem. The system in the question uses ext4. ext3/ext4 use the Linux buffer cache for the journal, for directory contents, and some other metadata.

Certain file systems, including ext3, ext4, and ocfs2, use the jbd or jbd2 layer to handle their physical block journalling, and this layer fundamentally uses the buffer cache.

-- Email article by Ted Tso, 2013

Prior to Linux kernel version 2.4, Linux had separate page and buffer caches. Since 2.4, the page and buffer cache are unified and Buffers is raw disk blocks not represented in the page cache—i.e., not file data.

...

The buffer cache remains, however, as the kernel still needs to perform block I/O in terms of blocks, not pages. As most blocks represent file data, most of the buffer cache is represented by the page cache. But a small amount of block data isn't file backed—metadata and raw block I/O for example—and thus is solely represented by the buffer cache.

-- A pair of Quora answers by Robert Love, last updated 2013.

Both writers are Linux developers who have worked with Linux kernel memory management. The first source is more specific about technical details. The second source is a more general summary, which might be contradicted and outdated in some specifics.

It is true that filesystems may perform partial-page metadata writes, even though the cache is indexed in pages. Even user processes can perform partial-page writes when they use write() (as opposed to mmap()), at least directly to a block device. This only applies to writes, not reads. When you read through the page cache, the page cache always reads full pages.

Linus liked to rant that the buffer cache is not required in order to do block-sized writes, and that filesystems can do partial-page metadata writes even with page cache attached to their own files instead of the block device. I am sure he is right to say that ext2 does this. ext3/ext4 with its journalling system does not. It is less clear what the issues were that led to this design. The people he was ranting at got tired of explaining.

ext4_readdir() has not been changed to satisfy Linus' rant. I don't see his desired approach used in readdir() of other filesystems either. I think XFS uses the buffer cache for directories as well. bcachefs does not use the page cache for readdir() at all; it uses its own cache for btrees. I'm not sure about btrfs.

4. Why might we expect Buffers in particular to be larger or smaller?

In this case it turns out the ext4 journal size for my filesystem is 128M. So this explains why 1) my buffer cache can stabilize at slightly over 128M; 2) buffer cache does not scale proportionally with the larger amount of RAM on my laptop.

For some other possible causes, see What is the buffers column in the output from free? Note that "buffers" reported by free is actually a combination of Buffers and reclaimable kernel slab memory.


To verify that journal writes use the buffer cache, I simulated a filesystem in nice fast RAM (tmpfs), and compared the maximum buffer usage for different journal sizes.

# dd if=/dev/zero of=/tmp/t bs=1M count=1000
...
# mkfs.ext4 /tmp/t -J size=256
...
# LANG=C dumpe2fs /tmp/t | grep '^Journal size'
dumpe2fs 1.43.5 (04-Aug-2017)
Journal size:             256M
# mount /tmp/t /mnt
# cd /mnt
# free -w -m
              total        used        free      shared     buffers       cache   available
Mem:           7855        2521        4321         285          66         947        5105
Swap:          7995           0        7995

# for i in $(seq 40000); do dd if=/dev/zero of=t bs=1k count=1 conv=sync status=none; sync t; sync -f t; done
# free -w -m
              total        used        free      shared     buffers       cache   available
Mem:           7855        2523        3872         551         237        1223        4835
Swap:          7995           0        7995

# dd if=/dev/zero of=/tmp/t bs=1M count=1000
...
# mkfs.ext4 /tmp/t -J size=16
...
# LANG=C dumpe2fs /tmp/t | grep '^Journal size'
dumpe2fs 1.43.5 (04-Aug-2017)
Journal size:             16M
# mount /tmp/t /mnt
# cd /mnt
# free -w -m
              total        used        free      shared     buffers       cache   available
Mem:           7855        2507        4337         285          66         943        5118
Swap:          7995           0        7995

# for i in $(seq 40000); do dd if=/dev/zero of=t bs=1k count=1 conv=sync status=none; sync t; sync -f t; done
# free -w -m
              total        used        free      shared     buffers       cache   available
Mem:           7855        2509        4290         315          77         977        5086
Swap:          7995           0        7995

History of this answer: How I came to look at the journal

I had found Ted Tso's email first, and was intrigued that it emphasized write caching. I would find it surprising if "dirty", unwritten data was able to reach 30% of RAM on my system. sudo atop shows that over a 10 second interval, the system in question consistently writes only 1MB. The filesystem concerned would be able to keep up with over 100 times this rate. (It's on a USB2 hard disk drive, max throughput ~20MB/s).

Using blktrace (btrace -w 10 /dev/sda) confirms that the IOs which are being cached must be writes, because there is almost no data being read. Also that mysqld is the only userspace process doing IO.

I stopped the service responsible for the writes (icinga2 writing to mysql) and re-checked. I saw "buffers" drop to under 20M - I have no explanation for that - and stay there. Restarting the writer again shows "buffers" rising by ~0.1M for each 10 second interval. I observed it maintain this rate consistently, climbing back to 70M and above.

Running echo 3 | sudo tee /proc/sys/vm/drop_caches was sufficient to lower "buffers" again, to 4.5M. This proves that my accumulation of buffers is a "clean" cache, which Linux can drop immediately when required. This system is not accumulating unwritten data. (drop_caches does not perform any writeback and hence cannot drop dirty pages. If you wanted to run a test which cleaned the cache first, you would use the sync command).

The entire mysql directory is only 150M. The accumulating buffers must represent metadata blocks from mysql writes, but it surprised me to think there would be so many metadata blocks for this data.