Linux – Writeback cache (`dirty`) seems to be limited below the expected threshold where throttling starts. it being limited by

cachelinuxlinux-kernelsysctl

In a previous question and answer, I showed an experiment about dirty_ratio.

Writeback cache (`dirty`) seems to be limited to even less than dirty_background_ratio. What is it being limited by? How is this limit calculated?

I thought I solved the question, by correcting my understanding of the dirty ratio calculation. But I repeated the experiment just now, and the write-back cache was limited to lower than I saw before. I can't work this out, what could be limiting it?

I have default values for the vm.dirty* sysctl's. dirty_background_ratio is 10, and dirty_ratio is 20. The "ratios" refer to the size of the dirty page cache aka write-back cache, as a percentage of MemFree + Cached. They are not a percentage of MemTotal – this was what confused me in the above question.

These ratios mean that reaching 10% causes background writeback to start, and 20% is the maximum size of the write-back cache. Additionally, I understand the write-back cache is limited by "I/O-less dirty throttling". When the write-back cache rises above 15%, processes which generate dirty pages e.g. with write() are "throttled". That is, the kernel causes the process to sleep inside the write() call. So the kernel can control the size of the write-back cache, by controlling the length of the sleeps. For references, see the answer to my previous question.

But my observed "ratio" seems to stay distinctly lower than the 15% throttling threshold. There must be some factor I am missing! Why is this happening?

In my previous test I saw values around 15-17.5% instead.

My kernel is Linux 4.18.16-200.fc28.x86_64.

The test is as follows: I ran dd if=/dev/zero of=~/test bs=1M status=progress. And at the same time, I monitored the achieved dirty ratio. I interrupted the dd command after 15GB.

$ while true; do grep -E '^(Dirty:|Writeback:|MemFree:|Cached:)' /proc/meminfo | tr '\n' ' '; echo; sleep 1; done
...
MemFree:          139852 kB Cached:          3443460 kB Dirty:            300240 kB Writeback:        135280 kB
MemFree:          145932 kB Cached:          3437220 kB Dirty:            319588 kB Writeback:        112080 kB
MemFree:          134324 kB Cached:          3448776 kB Dirty:            237612 kB Writeback:        160528 kB
MemFree:          134012 kB Cached:          3449004 kB Dirty:            169064 kB Writeback:        143256 kB
MemFree:          133760 kB Cached:          3449024 kB Dirty:            105484 kB Writeback:        119968 kB
MemFree:          133584 kB Cached:          3449032 kB Dirty:             49068 kB Writeback:        104412 kB
MemFree:          134712 kB Cached:          3449116 kB Dirty:                80 kB Writeback:         78740 kB
MemFree:          135448 kB Cached:          3449116 kB Dirty:                 8 kB Writeback:             0 kB

For example, the first line in the quoted output:

avail = 139852 + 3443460 = 3583312
dirty = 300240 + 135280 = 435520
ratio = 435520 / 3583312 = 0.122...

I found one thing that was limiting it, but not enough to see these results. I had been experimenting with setting /sys/class/bdi/*max_ratio. The test results in the question are from running with max_ratio = 1.

Repeating the above test with max_ratio = 100, I can achieve a higher dirty ratio e.g. 0.142:

MemFree:   122936 kB Cached:   3012244 kB Dirty:     333224 kB Writeback: 13532 kB

The write test needs to be quite long to observe this reliably, e.g. 8GB. This test takes about 100 seconds. I am using a spinning hard disk.

I tried testing with 4GB, and I only saw a dirty ratio of 0.129:

MemFree:   118388 kB Cached:   2982720 kB Dirty:     249020 kB Writeback: 151556 kB

As I say, this surprises me. I have an expert source from 2013, saying that dd should have "free run" to generate dirty pages until the system hits a dirty ratio of 0.15. It is explicitly talking about max_ratio.

Best Answer

The "ratios" refer to the size of the dirty page cache aka write-back cache, as a percentage of MemFree + Cached. They are not a percentage of MemTotal - this was what confused me in the above question.

No. This description is still inaccurate.

Cached includes all files in tmpfs, and other Shmem allocations. They are counted because they are implemented using the page cache. However they are not cache of any persistent storage. They can't just be dropped. tmpfs pages can be swapped, but swappable pages are not included in the calculation.

I had 500-600MB of Shmem. This was about the right amount, to explain why limit / 0.20 was lower than I had expected, when I tried looking the tracepoint again (see answer to previous question).

Also Cached excludes Buffers, which can be a surprisingly large amount on certain setups.

I think I should look carefully at the implementation of global_dirtyable_pages() for my kernel version, and use the more low-level counters exposed in /proc/vmstat. Or perhaps focus on using the tracepoint instead.