After some more investigation, it looks like this issue is less kernel related and more about how rsync
and CIFS interact.
As far as I can make out, what is happening is that when rsync
closes the destination file, CIFS (and probably any network filesystem) ensures the file is completely flushed and written to the remote disk before the close
syscall returns. This is to assure any application that once the close operation completes successfully, the file has been completely saved and there is no risk of any further errors that could cause data loss.
If this wasn't done, then it would be possible for an application to close a file, exit thinking the save operation was successful, then later (perhaps due to a network problem) the data could not be written after all, but by then it is too late for the application to do anything about it, such as asking the user if they want to save the file somewhere else instead.
This requirement means that every time rsync
finishes copying a file, the entire disk buffer must empty out over the network before rsync
is allowed to continue reading the next file.
A workaround is to mount the CIFS share with the option cache=none
which disables this feature, and causes all I/O to go direct to the server. This eliminates the problem and allows reads and writes to execute in parallel, however a drawback of this solution is that the performance is somewhat lower. In my case, network transfer speed drops from 110MB/sec to 80MB/sec.
This may mean that if you are copying large files, performance may well be better with the alternating read/write behaviour. With many smaller files, disabling the cache will result in fewer cache flushes each time a file is closed so performance may increase there.
It seems rsync
needs an option to close its file handles in another thread, so it can start reading the next file while the last one is still being flushed.
EDIT: I have confirmed that cache=none
definitely helps when transferring lots of small files (brings it from 10MB/sec up to 80MB/sec) but when transferring large files (1GB+) cache=none
drops the transfer from 110MB/sec down to the same 80MB/sec. This suggests that the slow transfer from many small files is less about the source disk seeking, and more about having so many cache flushes from all the small files.
The problem is the "USB-stick stall" article provides no evidence for its claim. There have been genuine "USB-stick stall" problems, and there continue to be some similar reports. However the thread discussed by the LWN article is not one of them! Therefore we cannot cite the article as an example. Additionally, any explanations it gives must be flawed, or at least incomplete.
Why were "USB-stick stall" problems reported in 2013? Why wasn't this problem solved by the existing "No-I/O dirty throttling" code?
To summarize the linked answer:
The problem reported to linux-kernel did not see the entire system hang, while it was flushing cached writes to a USB stick. The initial report by Artem simply complained that Linux allowed a very large amount of cached writes on a slow device, which could take up to "dozens of minutes" before they finished.
As you say, Linus' suggested "fix" has not been applied. Current kernel versions (v4.20 and below) still allow systems with large RAM to build up large amounts writes in the page cache, which can take a long time to write out.
The kernel already had some code designed to avoid "USB-stick stalls". This is the "No-I/O dirty throttling" code. This code was also described on LWN, in 2011. It throttles write() calls to control both the size of the overall writeback cache, and the proportion of writeback cache used for the specific backing device. This is a complex engineered system, which has been tweaked over time. I am sure it will have some limitations. So far I am not able to quantify any limitation. There have also been various bugfixes outside the dirty throttling code, for issues which prevented it from being able to work.
WBT limits the number of submitted IO requests for each individual device. It does not limit the writeback cache, i.e. the dirty page cache.
Artem posted a followup report that writing 10GB to a server's internal disk caused the system to hang, or at least suffer extremely long delays in responding. That is consistent with the problem that WBT aims to address.
Sidenotes kept from previous versions of this answer:
The scenario described for WBT is when you are writing a large batch of data to your main disk, and at the same time you want to keep using your main disk interactively, to load programs etc.
In contrast, when people talk about a "USB-stick stall" problem, they mean writing a large batch of data to a different disk / external USB etc, and then suffering surprising delays in programs that have nothing to do with that disk. Example:
"Even things as simple as moving windows around could stutter... It wasn't CPU load, because ssh sessions to remote machines were perfectly responsive; instead it seemed that anything that might vaguely come near doing filesystem IO was extensively delayed."
The 2013 mailing list thread about the USB stick problem, mentioned per-device limits on dirty page cache as a possibility for future work.
WBT does not work with the CFQ or BFQ IO schedulers. Debian and Fedora use CFQ by default, so WBT will not help for USB sticks (nor spinning hard drives) unless you have some special configuration.
Traditionally CFQ has been used to work well with spinning hard drives. I'm not entirely sure where this leaves WBT. Maybe the main advantage of WBT is for SSDs, which are faster than spinning hard drives, but too slow to treat like RAM?
Or maybe it's an argument to use the deadline
scheduler instead, and forgo the CFQ features. Ubuntu switched to deadline
in version 14.04, but switched back to CFQ since version 17.04 (zesty). (I think CentOS 7.0 is too old to have WBT, but it claims to use CFQ for SATA drives, and deadline
for all other drives. CentOS 7.0 also supports NVMe drives, but only shows "none" for their scheduler.)
Best Answer
dirty_ratio per device
There are some settings for this, but they are not as effective as you hoped for. See the
bdi
("backing device") objects insysfs
:The catch is "this setting only takes effect after we have more than (dirty_background_ratio+dirty_ratio)/2 dirty data in total. Because that is the amount of dirty data when we start to throttle processes. So if the device you'd like to limit is the only one which is currently written to, the limiting doesn't have a big effect." Further reading:
For simplicity, let us ignore your 30% setting and assume the defaults: dirty_background_ratio=10 and dirty_ratio=20. In this case, processes are allowed to dirty pages without any delays, until the system as a whole reaches the 15% point.
:-/
This sounds similar to the "pernicious USB-stick stall problem", which LWN.net wrote an article about. Unfortunately this particular article is misleading. It was so confused that it fabricated a different problem from the one that was reported.
One possibility is that you are reproducing a more specific defect. If you can report it to kernel developers, they might be able to analyze it and find a solution. Like the interaction with transparent hugepages was solved. You would be expected to reproduce the problem using the upstream kernel. Or talk to your paid support contact :).
Otherwise, there is a patch that can be applied to expose the internal
strictlimit
setting. This lets you changemax_ratio
into a strict limit. The patch has not been applied to mainline. If enough people show a need for this, the patch might get applied, or it might encourage some work to remove the need for it.mm-add-strictlimit-knob-v2.patch
is still sitting in -mm. A couple of times, people mentioned ideas about better auto-tuning the dirty cache. I haven't found a lot of work on it though. An appealing suggestion is to keep 5 seconds worth of write-back cache per device. However the speed of a device can change suddenly, e.g. depending whether the IO pattern is random or sequential.Analysis (but no conclusion)
These are not treated exactly the same. See the quote from the BDI doc above. "Each device is given a part of the total write-back cache that relates to its current average writeout speed."
However, this still makes it possible for the slow device to fill up the overall write-back cache, to somewhere between the 15-20% marks, if the slow device is the only one being written to.
If you start writing to a device which has less than its allowed share of the maximum writeback cache, the "dirty throttling" code should make some allowances. This would let you use some of the remaining margin, and avoid having to wait for the slow device to make room for you.
The doc suggests min_ratio and max_ratio settings were added in case your device speeds vary unpredictably, including stalling while an NFS server is unavailable.
The problem is if the dirty throttling fails to control the slow device, and it manages to fill up to (or near) the 20% hard limit.
The dirty throttling code that we're interested in was reshaped in v3.2. For an introduction, see the LWN.net article "IO-less dirty throttling". Also, following the release, Fengguang Wu presented at LinuxCon Japan. His presentation slides are very detailed and informative.
The goal was to delegate all writeback for a BDI to a dedicated thread, to allow a much better pattern of IO. But they also had to change to a less direct throttling system. At best, this makes the code harder to reason about. It has been well-tested, but I'm not sure that it covers every possible operating regime.
In fact looking at v4.18, there is explicit fallback code for a more extreme version of your problem: when one BDI is completely non-responsive. It tries to make sure other BDI's can still make forward progress, but... they would be much more limited in how much writeback cache they can use. Performance would likely be degraded, even if there is only one writer.
You mention your system was under memory pressure. This is one example of a case which could be very challenging. When "available" memory goes down, it can put pressure on the size of the write-back cache. "dirty_ratio" is actually a percentage of "available" memory, which means free memory + page cache.
This case was noticed during the original work. There is an attempt to mitigate it. It says that "the new dirty limits are not going to avoid throttling the light dirtiers, but could limit their sleep time to 200ms."
Test case for "max_ratio"
Set up a VM / laptop / whatever, which does not have an expensively large amount of RAM. Run
dd if=/dev/zero bs=1M of=~/test
, and watch the write cache withgrep -E '^(Dirty:|Writeback:)' /proc/meminfo
. You should see dirty+writeback settle around a "set point".The set point is 17.5%, half-way between 15% and 20%. My results on Linux v4.18 are here. If you want to see an exact percentage, be aware that the ratios are not a percentage of total RAM; I suggest you use the tracepoint in dirty_balance_pages().
I ran this test with different values of
max_ratio
in the filesystem's BDI. As expected, it was not possible to limit the write-back cache below the 15% point.