Linux – dirty_ratio per device

cachecifslinux-kernelnfsrhel

I've recently examined a RHEL7.2 that hanged almost totally just because it have written to a CIFS filesystem. With the default settings of dirty_ratio = 30 and cifs being cached (for both reading and writing), these dirty pages were mostly cifs ones.

Under memory pressure, when system reclaimed most of the read cache, system stubbornly tried to flush&reclaim the dirty (write) cache. So the situation was a huge CPU iowait accompanied with an excellent local disk I/O completion time, a lot of processes in D uninterruptible wait and a totally unresponsive system. OOM killer never engaged, because there was free memory that system wasn't giving out. (I think there is also a bug with CIFS, that crawled the flushing to incredibly slow speeds. But nevermind that here.)

I was flabbergasted to find out that kernel treated flushing pages to some slow remote CIFS box in exactly the same way as to super-fast local SSD drive. It's just insensible to have a single dirty_ratio bag, it quickly leads to the situation where 30% of RAM contains dirty data from the slowest devices. What a waste of money.

The situation is reproducible; setting dirty_ratio = 1 solves the problem completely. But why do I need to sacrifice the cache of local disks just because I use a cifs mount?

Other than completely disabling caching of some devices, or setting vm.dirty_ratio to a very low value, are there any ways to "whitelist" the fast devices to have more write cache? Or to have the slow devices (or remote "devices" like //cifs/paths) use less write cache?

The kernel version for RHEL 7.2 is referred to as 3.10.0-327. (It is based on 3.10.0, but includes several years worth of backports).

Best Answer

dirty_ratio per device

Q: Are there any ways to "whitelist" the fast devices to have more write cache? Or to have the slow devices (or remote "devices" like //cifs/paths) use less write cache?

There are some settings for this, but they are not as effective as you hoped for. See the bdi ("backing device") objects in sysfs:

linux-4.18/Documentation/ABI/testing/sysfs-class-bdi

min_ratio (read-write)

Under normal circumstances each device is given a part of the total write-back cache that relates to its current average writeout speed in relation to the other devices.

The 'min_ratio' parameter allows assigning a minimum percentage of the write-back cache to a particular device. For example, this is useful for providing a minimum QoS.

max_ratio (read-write)

Allows limiting a particular device to use not more than the given percentage of the write-back cache. This is useful in situations where we want to avoid one device taking all or most of the write-back cache. For example in case of an NFS mount that is prone to get stuck.

The catch is "this setting only takes effect after we have more than (dirty_background_ratio+dirty_ratio)/2 dirty data in total. Because that is the amount of dirty data when we start to throttle processes. So if the device you'd like to limit is the only one which is currently written to, the limiting doesn't have a big effect." Further reading:

LKML post by Jan Kara (2013).
The "test case", at the end of this answer.
commit 5fce25a9df48 in v2.6.24. "We allow violation of bdi limits if there is a lot of room on the system. Once we hit half the total limit we start enforcing bdi limits..." This is part of the same kernel release that added the internal per-device "limits". So the "limits" have always worked like this, except for pre-releases v2.6.24-rc1 and -rc2.

For simplicity, let us ignore your 30% setting and assume the defaults: dirty_background_ratio=10 and dirty_ratio=20. In this case, processes are allowed to dirty pages without any delays, until the system as a whole reaches the 15% point.

Q: The situation is reproducible; setting dirty_ratio = 1 solves the problem completely.

:-/

This sounds similar to the "pernicious USB-stick stall problem", which LWN.net wrote an article about. Unfortunately this particular article is misleading. It was so confused that it fabricated a different problem from the one that was reported.

One possibility is that you are reproducing a more specific defect. If you can report it to kernel developers, they might be able to analyze it and find a solution. Like the interaction with transparent hugepages was solved. You would be expected to reproduce the problem using the upstream kernel. Or talk to your paid support contact :).

Otherwise, there is a patch that can be applied to expose the internal strictlimit setting. This lets you change max_ratio into a strict limit. The patch has not been applied to mainline. If enough people show a need for this, the patch might get applied, or it might encourage some work to remove the need for it.

My concern is that while potentially useful, the feature might not be sufficiently useful to justify its inclusion. So we'll end up addressing these issues by other means, then we're left maintaining this obsolete legacy feature.

I'm thinking that unless someone can show that this is good and complete and sufficient for a "large enough" set of issues, I'll take a pass on the patch[1]. What do people think?

[1] Actually, I'll stick it in -mm and maintain it, so next time someone reports an issue I can say "hey, try this".

-- Andrew Morton, 2013

mm-add-strictlimit-knob-v2.patch is still sitting in -mm. A couple of times, people mentioned ideas about better auto-tuning the dirty cache. I haven't found a lot of work on it though. An appealing suggestion is to keep 5 seconds worth of write-back cache per device. However the speed of a device can change suddenly, e.g. depending whether the IO pattern is random or sequential.

Analysis (but no conclusion)

Q: I was flabbergasted to find out that kernel treated flushing pages to some slow remote CIFS box in exactly the same way as to super-fast local SSD drive.

These are not treated exactly the same. See the quote from the BDI doc above. "Each device is given a part of the total write-back cache that relates to its current average writeout speed."

However, this still makes it possible for the slow device to fill up the overall write-back cache, to somewhere between the 15-20% marks, if the slow device is the only one being written to.

If you start writing to a device which has less than its allowed share of the maximum writeback cache, the "dirty throttling" code should make some allowances. This would let you use some of the remaining margin, and avoid having to wait for the slow device to make room for you.

The doc suggests min_ratio and max_ratio settings were added in case your device speeds vary unpredictably, including stalling while an NFS server is unavailable.

The problem is if the dirty throttling fails to control the slow device, and it manages to fill up to (or near) the 20% hard limit.

The dirty throttling code that we're interested in was reshaped in v3.2. For an introduction, see the LWN.net article "IO-less dirty throttling". Also, following the release, Fengguang Wu presented at LinuxCon Japan. His presentation slides are very detailed and informative.

The goal was to delegate all writeback for a BDI to a dedicated thread, to allow a much better pattern of IO. But they also had to change to a less direct throttling system. At best, this makes the code harder to reason about. It has been well-tested, but I'm not sure that it covers every possible operating regime.

In fact looking at v4.18, there is explicit fallback code for a more extreme version of your problem: when one BDI is completely non-responsive. It tries to make sure other BDI's can still make forward progress, but... they would be much more limited in how much writeback cache they can use. Performance would likely be degraded, even if there is only one writer.

Q: Under memory pressure, when system reclaimed most of the read cache, system stubbornly tried to flush&reclaim the dirty (write) cache. So the situation was a huge CPU iowait accompanied with an excellent local disk I/O completion time, a lot of processes in D uninterruptible wait and a totally unresponsive system. OOM killer never engaged, because there was free memory that system wasn't giving out. (I think there is also a bug with CIFS, that crawled the flushing to incredibly slow speeds. But nevermind that here.)

You mention your system was under memory pressure. This is one example of a case which could be very challenging. When "available" memory goes down, it can put pressure on the size of the write-back cache. "dirty_ratio" is actually a percentage of "available" memory, which means free memory + page cache.

This case was noticed during the original work. There is an attempt to mitigate it. It says that "the new dirty limits are not going to avoid throttling the light dirtiers, but could limit their sleep time to 200ms."

Test case for "max_ratio"

Set up a VM / laptop / whatever, which does not have an expensively large amount of RAM. Run dd if=/dev/zero bs=1M of=~/test, and watch the write cache with grep -E '^(Dirty:|Writeback:)' /proc/meminfo. You should see dirty+writeback settle around a "set point".

The set point is 17.5%, half-way between 15% and 20%. My results on Linux v4.18 are here. If you want to see an exact percentage, be aware that the ratios are not a percentage of total RAM; I suggest you use the tracepoint in dirty_balance_pages().

I ran this test with different values of max_ratio in the filesystem's BDI. As expected, it was not possible to limit the write-back cache below the 15% point.

Best Answer

dirty_ratio per device

Analysis (but no conclusion)

Test case for "max_ratio"

Related Solutions

Linux – Make Linux write to network filesystem concurrently with local disk reads

Linux – Is “writeback throttling” a solution to the “USB-stick stall problem”

Related Question