dirty_ratio per device
Q: Are there any ways to "whitelist" the fast devices to have more write cache? Or to have the slow devices (or remote "devices" like //cifs/paths) use less write cache?
There are some settings for this, but they are not as effective as you hoped for. See the bdi
("backing device") objects in sysfs
:
linux-4.18/Documentation/ABI/testing/sysfs-class-bdi
min_ratio (read-write)
Under normal circumstances each device is given a part of the total
write-back cache that relates to its current average writeout speed
in relation to the other devices.
The 'min_ratio' parameter allows assigning a minimum percentage of
the write-back cache to a particular device. For example, this is
useful for providing a minimum QoS.
max_ratio (read-write)
Allows limiting a particular device to use not more than the given
percentage of the write-back cache. This is useful in situations
where we want to avoid one device taking all or most of the
write-back cache. For example in case of an NFS mount that is prone
to get stuck.
The catch is "this setting only takes effect after
we have more than (dirty_background_ratio+dirty_ratio)/2 dirty data in
total. Because that is the amount of dirty data when we start to throttle
processes. So if the device you'd like to limit is the only one which is
currently written to, the limiting doesn't have a big effect." Further reading:
- LKML post by Jan Kara (2013).
- The "test case", at the end of this answer.
- commit 5fce25a9df48 in v2.6.24. "We allow violation of bdi limits if there is a lot of room on the system. Once we hit half the total limit we start enforcing bdi limits..." This is part of the same kernel release that added the internal per-device "limits". So the "limits" have always worked like this, except for pre-releases v2.6.24-rc1 and -rc2.
For simplicity, let us ignore your 30% setting and assume the defaults: dirty_background_ratio=10 and dirty_ratio=20. In this case, processes are allowed to dirty pages without any delays, until the system as a whole reaches the 15% point.
Q: The situation is reproducible; setting dirty_ratio = 1
solves the problem completely.
:-/
This sounds similar to the "pernicious USB-stick stall problem", which LWN.net wrote an article about. Unfortunately this particular article is misleading. It was so confused that it fabricated a different problem from the one that was reported.
One possibility is that you are reproducing a more specific defect. If you can report it to kernel developers, they might be able to analyze it and find a solution. Like the interaction with transparent hugepages was solved. You would be expected to reproduce the problem using the upstream kernel. Or talk to your paid support contact :).
Otherwise, there is a patch that can be applied to expose the internal strictlimit
setting. This lets you change max_ratio
into a strict limit. The patch has not been applied to mainline. If enough people show a need for this, the patch might get applied, or it might encourage some work to remove the need for it.
My concern is that while potentially useful, the feature
might not be sufficiently useful to justify its inclusion. So we'll
end up addressing these issues by other means, then we're left
maintaining this obsolete legacy feature.
I'm thinking that unless someone can show that this is good and
complete and sufficient for a "large enough" set of issues, I'll take a
pass on the patch[1]. What do people think?
[1] Actually, I'll stick it in -mm and maintain it, so next time
someone reports an issue I can say "hey, try this".
-- Andrew Morton, 2013
mm-add-strictlimit-knob-v2.patch
is still sitting in -mm. A couple of times, people mentioned ideas about better auto-tuning the dirty cache. I haven't found a lot of work on it though. An appealing suggestion is to keep 5 seconds worth of write-back cache per device. However the speed of a device can change suddenly, e.g. depending whether the IO pattern is random or sequential.
Analysis (but no conclusion)
Q: I was flabbergasted to find out that kernel treated flushing pages to some slow remote CIFS box in exactly the same way as to super-fast local SSD drive.
These are not treated exactly the same. See the quote from the BDI doc above. "Each device is given a part of the total write-back cache that relates to its current average writeout speed."
However, this still makes it possible for the slow device to fill up the overall write-back cache, to somewhere between the 15-20% marks, if the slow device is the only one being written to.
If you start writing to a device which has less than its allowed share of the maximum writeback cache, the "dirty throttling" code should make some allowances. This would let you use some of the remaining margin, and avoid having to wait for the slow device to make room for you.
The doc suggests min_ratio and max_ratio settings were added in case your device speeds vary unpredictably, including stalling while an NFS server is unavailable.
The problem is if the dirty throttling fails to control the slow device, and it manages to fill up to (or near) the 20% hard limit.
The dirty throttling code that we're interested in was reshaped in v3.2. For an introduction, see the LWN.net article "IO-less dirty throttling". Also, following the release, Fengguang Wu presented at LinuxCon Japan. His presentation slides are very detailed and informative.
The goal was to delegate all writeback for a BDI to a dedicated thread, to allow a much better pattern of IO. But they also had to change to a less direct throttling system. At best, this makes the code harder to reason about. It has been well-tested, but I'm not sure that it covers every possible operating regime.
In fact looking at v4.18, there is explicit fallback code for a more extreme version of your problem: when one BDI is completely non-responsive. It tries to make sure other BDI's can still make forward progress, but... they would be much more limited in how much writeback cache they can use. Performance would likely be degraded, even if there is only one writer.
Q: Under memory pressure, when system reclaimed most of the read cache, system stubbornly tried to flush&reclaim the dirty (write) cache. So the situation was a huge CPU iowait accompanied with an excellent local disk I/O completion time, a lot of processes in D uninterruptible wait and a totally unresponsive system. OOM killer never engaged, because there was free memory that system wasn't giving out. (I think there is also a bug with CIFS, that crawled the flushing to incredibly slow speeds. But nevermind that here.)
You mention your system was under memory pressure. This is one example of a case which could be very challenging. When "available" memory goes down, it can put pressure on the size of the write-back cache. "dirty_ratio" is actually a percentage of "available" memory, which means free memory + page cache.
This case was noticed during the original work. There is an attempt to mitigate it. It says that "the new dirty limits are not going to avoid throttling the light dirtiers, but could limit their sleep time to 200ms."
Test case for "max_ratio"
Set up a VM / laptop / whatever, which does not have an expensively large amount of RAM. Run dd if=/dev/zero bs=1M of=~/test
, and watch the write cache with grep -E '^(Dirty:|Writeback:)' /proc/meminfo
. You should see dirty+writeback settle around a "set point".
The set point is 17.5%, half-way between 15% and 20%. My results on Linux v4.18 are here. If you want to see an exact percentage, be aware that the ratios are not a percentage of total RAM; I suggest you use the tracepoint in dirty_balance_pages().
I ran this test with different values of max_ratio
in the filesystem's BDI. As expected, it was not possible to limit the write-back cache below the 15% point.
The problem is the "USB-stick stall" article provides no evidence for its claim. There have been genuine "USB-stick stall" problems, and there continue to be some similar reports. However the thread discussed by the LWN article is not one of them! Therefore we cannot cite the article as an example. Additionally, any explanations it gives must be flawed, or at least incomplete.
Why were "USB-stick stall" problems reported in 2013? Why wasn't this problem solved by the existing "No-I/O dirty throttling" code?
To summarize the linked answer:
The problem reported to linux-kernel did not see the entire system hang, while it was flushing cached writes to a USB stick. The initial report by Artem simply complained that Linux allowed a very large amount of cached writes on a slow device, which could take up to "dozens of minutes" before they finished.
As you say, Linus' suggested "fix" has not been applied. Current kernel versions (v4.20 and below) still allow systems with large RAM to build up large amounts writes in the page cache, which can take a long time to write out.
The kernel already had some code designed to avoid "USB-stick stalls". This is the "No-I/O dirty throttling" code. This code was also described on LWN, in 2011. It throttles write() calls to control both the size of the overall writeback cache, and the proportion of writeback cache used for the specific backing device. This is a complex engineered system, which has been tweaked over time. I am sure it will have some limitations. So far I am not able to quantify any limitation. There have also been various bugfixes outside the dirty throttling code, for issues which prevented it from being able to work.
WBT limits the number of submitted IO requests for each individual device. It does not limit the writeback cache, i.e. the dirty page cache.
Artem posted a followup report that writing 10GB to a server's internal disk caused the system to hang, or at least suffer extremely long delays in responding. That is consistent with the problem that WBT aims to address.
Sidenotes kept from previous versions of this answer:
The scenario described for WBT is when you are writing a large batch of data to your main disk, and at the same time you want to keep using your main disk interactively, to load programs etc.
In contrast, when people talk about a "USB-stick stall" problem, they mean writing a large batch of data to a different disk / external USB etc, and then suffering surprising delays in programs that have nothing to do with that disk. Example:
"Even things as simple as moving windows around could stutter... It wasn't CPU load, because ssh sessions to remote machines were perfectly responsive; instead it seemed that anything that might vaguely come near doing filesystem IO was extensively delayed."
The 2013 mailing list thread about the USB stick problem, mentioned per-device limits on dirty page cache as a possibility for future work.
WBT does not work with the CFQ or BFQ IO schedulers. Debian and Fedora use CFQ by default, so WBT will not help for USB sticks (nor spinning hard drives) unless you have some special configuration.
Traditionally CFQ has been used to work well with spinning hard drives. I'm not entirely sure where this leaves WBT. Maybe the main advantage of WBT is for SSDs, which are faster than spinning hard drives, but too slow to treat like RAM?
Or maybe it's an argument to use the deadline
scheduler instead, and forgo the CFQ features. Ubuntu switched to deadline
in version 14.04, but switched back to CFQ since version 17.04 (zesty). (I think CentOS 7.0 is too old to have WBT, but it claims to use CFQ for SATA drives, and deadline
for all other drives. CentOS 7.0 also supports NVMe drives, but only shows "none" for their scheduler.)
Best Answer
1. The 2013 article is wrong
The "USB-stick stall" article gives you a very misleading impression. It misrepresents both the original report, and the series of responses.
Artem did not report the entire system hanging when it flushed cached writes to a USB stick. His original report only complained that running the command "sync" could take up to "dozens of minutes". This distinction is made explicit in a response by Linus Torvalds:
2. A mistake in LWN? Are you sure?
Jon Corbet had fifteen years of experience, reporting Linux kernel development on a weekly basis. I expected the article was at least close to getting it right, in some sense. So I want to process the two different records, and look out for detailed points where they agree or disagree.
I read all the the original discussion, using the archives at lore.kernel.org. I think the messages are pretty clear.
I am 100% certain the article misinterprets the discussion. In comments underneath the article, at least two readers repeated the false claim in their own words, and no-one corrected them. The article continues this confusion in the third paragraph:
This could be confusion from Linus saying "the thing just comes to a screeching halt". "The thing" refers to "anything that does
sync
". But Corbet writes as if "the thing" meant "the entire system".As per Linus, this is a real-world problem. But the vast majority of "things" do not call into the system-wide sync() operation.[1]
Why might Corbet confuse this with "the entire system"? I guess there have been a number of problems, and after a while it gets hard to keep them all separate in your head :-). And although LWN has described the development of per-device (and per-process) dirty throttling, in general I suppose there is not much written about such details. A lot of documents only describe the global dirty limit settings.
3. Long queues in I/O devices, created by "background" writeback
Artem posted a second report in the thread, where "the server almost stalls and other IO requests take a lot more time to complete".
This second report does not match claims about a USB-stick hang. It happened after creating a 10GB file on an internal disk. This is a different problem.
The report did not confirm whether this could be improved by changing the dirty limits. And there is a more recent analysis of cases like this. There is a significant problem when it clogs up the I/O queue of your main disk. You can suffer long delays on a disk that you constantly rely on, to load program code on-demand, save documents and app data using write() + fsync(), etc.
The patches were merged to improve this in late 2016 (Linux 4.10). This code is referred to as "writeback throttling" or WBT. Searching the web for
wbt_lat_usec
also finds a few more stories about this. (The initial doc writes aboutwb_lat_usec
, but it is out of date). Be aware that writeback throttling does not work with the CFQ or BFQ I/O schedulers. CFQ has been popular as a default I/O scheduler, including in default kernel builds up until Linux v4.20. CFQ is removed in kernel v5.0.There were tests to illustrate the problem (and prototype solution) on both an SSD (which looked like NVMe) and a "regular hard drive". The hard drive was "not quite as bad as deeper queue depth devices, where we have hugely bursty IO".
I'm not sure about the "thousands" of queued requests, but there are at least NVMe devices which can queue hundreds of requests. Most SATA hard drives allow 32 requests to be queued ("NCQ"). Of course the hard drive would take longer to complete each request.
4. Limitations of "no-I/O dirty throttling"?
"No-I/O dirty throttling" is quite a complex engineered system. It has also been tweaked over time. I am sure there were, and still are, some limitations inside this code.
The LWN writeup, code/patch comments, and the slides from the detailed presentation show that a large number of scenarios have been considered. This includes the notorious slow USB stick v.s. fast main drive. The test cases include the phrase "1000 concurrent dd's" (i.e. sequential writers).
So far, I do not know how to demonstrate and reproduce any limitation inside the dirty throttling code.
I have seen several descriptions of problem fixes which were outside of the dirty throttling code. The most recent fix I found was in 2014 - see subsequent sections. In the thread that LWN is reporting on, we learn:
Mel Gorman also said there were some "outstanding issues".
This passage was the only thing I could find in the reported discussion thread, that comes anywhere near backing up the LWN interpretation. I wish I understood what it was referring to :-(. Or how to demonstrate it, and why it did not seem to come up as a significant issue in the tests that Artem and Linus ran.
5. Genuine reports of "USB-stick stall" problems
Although neither Artem nor Linux reported a "USB-stick stall" that affected the whole system, we can find several reports of this elsewhere. This includes reports in recent years - well after the last known fix.
I do not know what the difference is. Maybe their test conditions were different in some way, or maybe there are some new problem(s) created in the kernel since 2013...
6. The dirty limit was calculated incorrectly [2014]
There was an interesting fix in January 2014 (applied in kernel v3.14). In the question, we said the default limit was set to 20% of memory. Actually, it is set to 20% of memory which is available for dirty page cache. For example the kernel buffers sent data for TCP/IP network sockets. The socket buffers cannot be dropped and replaced with dirty page cache :-).
The problem was that the kernel was counting swappable memory, as if it could swap data out in favour of dirty page cache. Although this is possible in theory, the kernel is strongly biased to avoid swapping, and prefer dropping page cache instead. This problem was illustrated by - guess what - a test involving writing to a slow USB stick, and noticing that it caused stalls across the entire system :-).
See Re: [patch 0/2] mm: reduce reclaim stalls with heavy anon and dirty cache
The fix is that
dirty_ratio
is now treated as a proportion of file cache only.According to the kernel developer who suffered the problem, "the trigger conditions seem quite plausible - high anon memory usage w/ heavy buffered IO and swap configured - and it's highly likely that this is happening in the wild." So this might account for some user reports around 2013 or earlier.
7. Huge page allocations blocking on IO [2011]
This was another issue: Huge pages, slow drives, and long delays (LWN.net, November 2011). This issue with huge pages should now be fixed.
Also, despite what the article says, I think most current Linux PCs do not really use huge pages. This might be changing starting with Debian 10. However, even as Debian 10 starts allocating huge pages where possible, it seems clear to me that it will not impose any delays, unless you change another setting called
defrag
to "always".8. "Dirty pages reaching the end of the LRU" [pre-2013]
I have not looked into this, but I found it interesting:
If these are two different "reaching the end of the LRU" problems, then the first one sounds like it could be very bad. It sounds like when a dirty page become the least recently used page, any attempt to allocate memory would be delayed, until that dirty page finished being written.
Whatever it means, he says the problem is now fixed.
[1] One exception: for a while, the Debian package manager
dpkg
used sync() to improve performance. This was removed, because of the exact problem that sync() could take an extremely long time. They switched to an approach usingsync_file_range()
on Linux. See Ubuntu bug #624877, comment 62.Part of a previous attempt at answering this question - this should mostly be redundant:
I think we can explain both of Artem's reports as being consistent with the "No-I/O dirty throttling" code.
The dirty throttling code aims to allow each backing device a fair share of the "total write-back cache", "that relates to its current average writeout speed in relation to the other devices". This phrasing is from the documentation of /sys/class/bdi/.[2]
In the simplest case, only one backing device is being written to. In that case, the device's fair share is 100%. write() calls are throttled to control the overall writeback cache, and keep it at a "setpoint".
Writes start being throttled half-way between
dirty_background_ratio
- the point that initiates background writeout - anddirty_ratio
- the hard limit on the writeback cache. By default, these are 10% and 20% of available memory.For example, you could still fill up to 15% writing to your main disk only. You could have gigabytes of cached writes, according to how much RAM you have. At that point, write() calls will start being throttled to match the writeback speed - but that's not a problem. I expect the hang problems are for read() and fsync() calls, which get stuck behind large amounts of unrelated IO. This is the specific problem addressed by the "writeback throttling" code. Some of the WBT patch submissions include problem descriptions, showing the horrific delays this causes.
Similarly, you could fill up the 15% entirely with writes to a USB stick. Further write()s to the USB will be throttled. But the main disk would not be using any of its fair share. If you start calling write() on your main filesystem then it will not be throttled, or will at least be delayed much less. And I think the USB write()s would be throttled even more, to bring the two writers into balance.
I expect the overall writeback cache could temporarily rise above the setpoint. In some more demanding cases, you can hit the hard limit on overall writeback cache. The hard limit defaults to 20% of available memory; the configuration option is
dirty_ratio
/dirty_bytes
. Perhaps you can hit this because a device can slow down (perhaps because of a more random I/O pattern), and dirty throttling does not recognize the change in speed immediately.[2] You might notice this document suggests you can manually limit the proportion of writeback cache, that can be used for a specific partition/filesystem. The setting is called
/sys/class/bdi/*/max_ratio
. Be aware that "if the device you'd like to limit is the only one which is currently written to, the limiting doesn't have a big effect."