Linux – Writeback cache (`dirty`) seems to be limited to even less than dirty_background_ratio. What is it being limited by? How is this limit calculated

atopcachelinuxlinux-kernelsysctl

I have been testing Linux 4.18.16-200.fc28.x86_64. My system has 7.7G total RAM, according to free -h.

I have default values for the vm.dirty* sysctl's. dirty_background_ratio is 10, and dirty_ratio is 20. Based on everything I've read, I expect Linux to begin writeout of dirty cache when it reaches 10% of RAM: 0.77G. And buffered write() calls should block when dirty cache reaches 20% of RAM: 1.54G.

I ran dd if=/dev/zero of=~/test bs=1M count=2000 and watched the dirty field in atop. While the dd command was running, the dirty value settled at around 0.5G. This is significantly less than the dirty background threshold (0.77G)! How can this be? What am I missing?

dirty_expire_centisecs is 3000, so I don't think that can be the cause. I even tried lowering dirty_expire_centisecs to 100, and dirty_writeback_centisecs to 10, to see if that was limiting dirty. This did not change the result.

I initially wrote these observations as part of this investigation: Why were "USB-stick stall" problems reported in 2013? Why wasn't this problem solved by the existing "No-I/O dirty throttling" code?


I understand that half-way between the two thresholds – 15% = 1.155G – write() calls start being throttled (delayed) on a curve. But no delay is added when underneath this ceiling; the processes generating dirty pages are allowed "free run".

As I understand it, the throttling aims to keep the dirty cache somewhere at or above 15%, and prevent hitting the 20% hard limit. It does not provide a guarantee for every situation. But I'm testing a simple case with one dd command; I think it should simply ratelimit the write() calls to match the writeout speed achieved by the device.

(There is not a simple guarantee because there are some complex exceptions. For example, the throttle code limits the delay it will impose to a maximum of 200ms. But not if the target ratelimit for the process is less than one page per second; in that case it will apply a strict ratelimit.)

  • Documentation/sysctl/vm.txt — Linux v4.18
  • No-I/O dirty throttling — 2011 LWN.net.
  • (dirty_background_ratio + dirty_ratio)/2 dirty data in
    total … is an amount of dirty data when we start to throttle
    processes — Jan Kara, 2013

  • Users will notice that the applications will get throttled once crossing
    the global (background + dirty)/2=15% threshold, and then balanced around
    17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
    memory

    — commit 143dfe8611a6, "writeback: IO-less balance_dirty_pages()"

  • The memory-management subsystem will, by default, try to limit dirty pages to a maximum of 15% of the memory on the system. There is a "magical function" called balance_dirty_pages() that will, if need be, throttle processes dirtying a lot of pages in order to match the rate at which pages are being dirtied and the rate at which they can be cleaned." — Writeback and control groups, 2015 LWN.net.

  • balance_dirty_pages() in Linux 4.18.16.

Best Answer

Look at Documentation/sysctl/vm.txt:

dirty_ratio

Contains, as a percentage of total available memory that contains free pages and reclaimable pages, the number of pages at which a process which is generating disk writes will itself start writing out dirty data.

The total available memory is not equal to total system memory.

The available memory is calculated in global_dirtyable_memory(). It is equal to the amount of free memory plus the page cache. It does not include swappable pages (i.e. anonymous memory allocations, memory which is not backed by a file).

This behaviour applies since Linux 3.14 (2014). Before this change, swappable pages were included in the global_dirtyable_memory() total.

Example statistics while running the dd command:

$ while true; do grep -E '^(Dirty:|Writeback:|MemFree:|Cached:)' /proc/meminfo | tr '\n' ' '; echo; sleep 1; done
MemFree:         1793676 kB Cached:          1280812 kB Dirty:                 4 kB Writeback:             0 kB
MemFree:         1240728 kB Cached:          1826644 kB Dirty:            386128 kB Writeback:         67608 kB
MemFree:         1079700 kB Cached:          1983696 kB Dirty:            319812 kB Writeback:        143536 kB
MemFree:          937772 kB Cached:          2121424 kB Dirty:            312048 kB Writeback:        112520 kB
MemFree:          755776 kB Cached:          2298276 kB Dirty:            389828 kB Writeback:         68408 kB
...
MemFree:          136376 kB Cached:          2984308 kB Dirty:            485332 kB Writeback:         51300 kB
MemFree:          101340 kB Cached:          3028996 kB Dirty:            450176 kB Writeback:        119348 kB
MemFree:          122304 kB Cached:          3021836 kB Dirty:            552620 kB Writeback:          8484 kB
MemFree:          101016 kB Cached:          3053628 kB Dirty:            501128 kB Writeback:         61028 kB

The last line shows about 3,150,000 kB "available" memory, and a total of 562,000 kB data either being written back or waiting for writeback. That makes it 17.8%. Although it seemed the proportion fluctuated above and below that level, and was more often closer to 15%. EDIT: although these figures look closer, please do not trust this method. It is still not the right calculation and it could give very wrong results. See the followup here.


I found this the hard way:

I noticed there is a tracepoint in balance_dirty_pages(), which can be used for "analyzing the dynamics of the throttling algorithms". So I used perf:

$ sudo perf list '*balance_dirty_pages'

List of pre-defined events (to be used in -e):

  writeback:balance_dirty_pages                      [Tracepoint event]
...
$ sudo perf record -e writeback:balance_dirty_pages dd if=/dev/zero of=~/test bs=1M count=2000
$ sudo perf script

It showed that dirty (measured in 4096-byte pages) was lower than I expected, because setpoint was low. I traced the code; it meant there must be a similarly low value for freerun in the tracepoint definition, which is set to (thresh + bg_thresh) / 2 ... and worked my way back to global_dirtyable_memory().