Linux – Why were “USB-stick stall” problems reported in 2013? Why wasn’t this problem solved by the existing “No-I/O dirty throttling” code

cachelinux

The pernicious USB-stick stall problem – LWN.net, November 2013.

Artem S. Tashkinov recently encountered a problem that will be familiar to at least some LWN readers. Plug a slow storage device (a USB stick, say, or a media player) into a Linux machine and write a lot of data to it. The entire system proceeds to just hang, possibly for minutes.

This time around, though, Artem made an interesting observation: the system would stall when running with a 64-bit kernel, but no such problem was experienced when using a 32-bit kernel on the same hardware.

The article explains that with a 64-bit kernel, the dirty page cache (writeback cache) was allowed to grow to 20% of memory by default. With a 32-bit kernel, it was effectively limited to ~180MB.

Linus suggested limiting it to ~180MB on 64-bit as well, however current Linux (v4.18) does not do this. Compare Linus's suggested patch, to the current function in Linux 4.18. The biggest argument against such changes came from Dave Chinner. He pointed out that reducing buffering too much would cause filesystems to suffer from fragmentation. He also explained that "for streaming IO we typically need at least
5s of cached dirty data to even out delays."

I am confused. Why did the USB-stick stall cause the entire system to hang?

I am confused because I read an earlier article describing code merged in 2011 (Linux 3.2). It shows the kernel should have been controlling the dirty page cache on a per-device basis:

No-I/O dirty throttling – LWN.net, 2011

That is where Fengguang's patch set comes in. He is attempting to create a control loop capable of determining how many pages each process should be allowed to dirty at any given time. Processes exceeding their limit are simply put to sleep for a while to allow the writeback system to catch up with them.

[…]

The goal of the system is to keep the number of dirty pages at the setpoint; if things get out of line, increasing amounts of force will be applied to bring things back to where they should be.

[…]

This ratio cannot really be calculated, though, without taking the backing device (BDI) into account. A process may be dirtying pages stored on a given BDI, and the system may have a surfeit of dirty pages at the moment, but the wisdom of throttling that process depends also on how many dirty pages exist for that BDI. […] A BDI with few dirty pages can clear its backlog quickly, so it can probably afford to have a few more, even if the system is somewhat more dirty than one might like. So the patch set tweaks the calculated pos_ratio for a specific BDI using a complicated formula looking at how far that specific BDI is from its own setpoint and its observed bandwidth. The end result is a modified pos_ratio describing whether the system should be dirtying more or fewer pages backed by the given BDI, and by how much.

Per-device control was added even earlier than this: Smarter write throttling, 2007 LWN.net. [PATCH 0/23] per device dirty throttling -v10. It was merged in Linux version 2.6.24.

Best Answer

  1. The 2013 article is wrong
  2. A mistake in LWN? Are you sure?
  3. Long queues in I/O devices, created by "background" writeback
  4. Limitations of "no-I/O dirty throttling"?
  5. Genuine reports of "USB-stick stall" problems
  6. The dirty limit was calculated incorrectly [2014]
  7. Huge page allocations blocking on IO [2011]
  8. "Dirty pages reaching the end of the LRU"? [pre-2013]

1. The 2013 article is wrong

The "USB-stick stall" article gives you a very misleading impression. It misrepresents both the original report, and the series of responses.

Artem did not report the entire system hanging when it flushed cached writes to a USB stick. His original report only complained that running the command "sync" could take up to "dozens of minutes". This distinction is made explicit in a response by Linus Torvalds:

It's actually really easy to reproduce by just taking your average USB key and trying to write to it. I just did it with a random ISO image, and it's painful. And it's not that it's painful for doing most other things in the background, but if you just happen to run anything that does "sync" (and it happens in scripts), the thing just comes to a screeching halt. For minutes.

2. A mistake in LWN? Are you sure?

Jon Corbet had fifteen years of experience, reporting Linux kernel development on a weekly basis. I expected the article was at least close to getting it right, in some sense. So I want to process the two different records, and look out for detailed points where they agree or disagree.

I read all the the original discussion, using the archives at lore.kernel.org. I think the messages are pretty clear.

I am 100% certain the article misinterprets the discussion. In comments underneath the article, at least two readers repeated the false claim in their own words, and no-one corrected them. The article continues this confusion in the third paragraph:

All that data clogs up the I/O queues, possibly delaying other operations. And, as soon as somebody calls sync(), things stop until that entire queue is written.

This could be confusion from Linus saying "the thing just comes to a screeching halt". "The thing" refers to "anything that does sync". But Corbet writes as if "the thing" meant "the entire system".

As per Linus, this is a real-world problem. But the vast majority of "things" do not call into the system-wide sync() operation.[1]

Why might Corbet confuse this with "the entire system"? I guess there have been a number of problems, and after a while it gets hard to keep them all separate in your head :-). And although LWN has described the development of per-device (and per-process) dirty throttling, in general I suppose there is not much written about such details. A lot of documents only describe the global dirty limit settings.

3. Long queues in I/O devices, created by "background" writeback

Artem posted a second report in the thread, where "the server almost stalls and other IO requests take a lot more time to complete".

This second report does not match claims about a USB-stick hang. It happened after creating a 10GB file on an internal disk. This is a different problem.

The report did not confirm whether this could be improved by changing the dirty limits. And there is a more recent analysis of cases like this. There is a significant problem when it clogs up the I/O queue of your main disk. You can suffer long delays on a disk that you constantly rely on, to load program code on-demand, save documents and app data using write() + fsync(), etc.

Toward less-annoying background writeback -- LWN.net, 2016

When the memory-management code decides to write a range of dirty data, the result is an I/O request submitted to the block subsystem. That request may spend some time in the I/O scheduler, but it is eventually dispatched to the driver for the destination device.

The problem is that, if there is a lot of dirty data to write, there may end up being vast numbers (as in thousands) of requests queued for the device. Even a reasonably fast drive can take some time to work through that many requests. If some other activity (clicking a link in a web browser, say, or launching an application) generates I/O requests on the same block device, those requests go to the back of that long queue and may not be serviced for some time. If multiple, synchronous requests are generated — page faults from a newly launched application, for example — each of those requests may, in turn, have to pass through this long queue. That is the point where things appear to just stop.

[...]

Most block drivers also maintain queues of their own internally. Those lower-level queues can be especially problematic since, by the time a request gets there, it is no longer subject to the I/O scheduler's control (if there is an I/O scheduler at all).

The patches were merged to improve this in late 2016 (Linux 4.10). This code is referred to as "writeback throttling" or WBT. Searching the web for wbt_lat_usec also finds a few more stories about this. (The initial doc writes about wb_lat_usec, but it is out of date). Be aware that writeback throttling does not work with the CFQ or BFQ I/O schedulers. CFQ has been popular as a default I/O scheduler, including in default kernel builds up until Linux v4.20. CFQ is removed in kernel v5.0.

There were tests to illustrate the problem (and prototype solution) on both an SSD (which looked like NVMe) and a "regular hard drive". The hard drive was "not quite as bad as deeper queue depth devices, where we have hugely bursty IO".

I'm not sure about the "thousands" of queued requests, but there are at least NVMe devices which can queue hundreds of requests. Most SATA hard drives allow 32 requests to be queued ("NCQ"). Of course the hard drive would take longer to complete each request.

4. Limitations of "no-I/O dirty throttling"?

"No-I/O dirty throttling" is quite a complex engineered system. It has also been tweaked over time. I am sure there were, and still are, some limitations inside this code.

The LWN writeup, code/patch comments, and the slides from the detailed presentation show that a large number of scenarios have been considered. This includes the notorious slow USB stick v.s. fast main drive. The test cases include the phrase "1000 concurrent dd's" (i.e. sequential writers).

So far, I do not know how to demonstrate and reproduce any limitation inside the dirty throttling code.

I have seen several descriptions of problem fixes which were outside of the dirty throttling code. The most recent fix I found was in 2014 - see subsequent sections. In the thread that LWN is reporting on, we learn:

In last few releases problems like this were caused by problems in reclaim which got fed up by seeing lots of dirty / under writeback pages and ended up stuck waiting for IO to finish.

[...] The systemtap script caught those type of areas and I believe they are fixed up.

Mel Gorman also said there were some "outstanding issues".

There are still problems though. If all dirty pages were backed by a slow device then dirty limiting is still eventually going to cause stalls in dirty page balancing [...]

This passage was the only thing I could find in the reported discussion thread, that comes anywhere near backing up the LWN interpretation. I wish I understood what it was referring to :-(. Or how to demonstrate it, and why it did not seem to come up as a significant issue in the tests that Artem and Linus ran.

5. Genuine reports of "USB-stick stall" problems

Although neither Artem nor Linux reported a "USB-stick stall" that affected the whole system, we can find several reports of this elsewhere. This includes reports in recent years - well after the last known fix.

I do not know what the difference is. Maybe their test conditions were different in some way, or maybe there are some new problem(s) created in the kernel since 2013...

6. The dirty limit was calculated incorrectly [2014]

There was an interesting fix in January 2014 (applied in kernel v3.14). In the question, we said the default limit was set to 20% of memory. Actually, it is set to 20% of memory which is available for dirty page cache. For example the kernel buffers sent data for TCP/IP network sockets. The socket buffers cannot be dropped and replaced with dirty page cache :-).

The problem was that the kernel was counting swappable memory, as if it could swap data out in favour of dirty page cache. Although this is possible in theory, the kernel is strongly biased to avoid swapping, and prefer dropping page cache instead. This problem was illustrated by - guess what - a test involving writing to a slow USB stick, and noticing that it caused stalls across the entire system :-).

See Re: [patch 0/2] mm: reduce reclaim stalls with heavy anon and dirty cache

The fix is that dirty_ratio is now treated as a proportion of file cache only.

According to the kernel developer who suffered the problem, "the trigger conditions seem quite plausible - high anon memory usage w/ heavy buffered IO and swap configured - and it's highly likely that this is happening in the wild." So this might account for some user reports around 2013 or earlier.

7. Huge page allocations blocking on IO [2011]

This was another issue: Huge pages, slow drives, and long delays (LWN.net, November 2011). This issue with huge pages should now be fixed.

Also, despite what the article says, I think most current Linux PCs do not really use huge pages. This might be changing starting with Debian 10. However, even as Debian 10 starts allocating huge pages where possible, it seems clear to me that it will not impose any delays, unless you change another setting called defrag to "always".

8. "Dirty pages reaching the end of the LRU" [pre-2013]

I have not looked into this, but I found it interesting:

mgorman 2011: This is a new type of USB-related stall because it is due to synchronous compaction writing where as in the past the big problem was dirty pages reaching the end of the LRU and being written by reclaim.

mgorman 2013: The work in that general area dealt with such problems as dirty pages reaching the end of the LRU (excessive CPU usage)

If these are two different "reaching the end of the LRU" problems, then the first one sounds like it could be very bad. It sounds like when a dirty page become the least recently used page, any attempt to allocate memory would be delayed, until that dirty page finished being written.

Whatever it means, he says the problem is now fixed.


[1] One exception: for a while, the Debian package manager dpkg used sync() to improve performance. This was removed, because of the exact problem that sync() could take an extremely long time. They switched to an approach using sync_file_range() on Linux. See Ubuntu bug #624877, comment 62.


Part of a previous attempt at answering this question - this should mostly be redundant:

I think we can explain both of Artem's reports as being consistent with the "No-I/O dirty throttling" code.

The dirty throttling code aims to allow each backing device a fair share of the "total write-back cache", "that relates to its current average writeout speed in relation to the other devices". This phrasing is from the documentation of /sys/class/bdi/.[2]

In the simplest case, only one backing device is being written to. In that case, the device's fair share is 100%. write() calls are throttled to control the overall writeback cache, and keep it at a "setpoint".

Writes start being throttled half-way between dirty_background_ratio - the point that initiates background writeout - and dirty_ratio - the hard limit on the writeback cache. By default, these are 10% and 20% of available memory.

For example, you could still fill up to 15% writing to your main disk only. You could have gigabytes of cached writes, according to how much RAM you have. At that point, write() calls will start being throttled to match the writeback speed - but that's not a problem. I expect the hang problems are for read() and fsync() calls, which get stuck behind large amounts of unrelated IO. This is the specific problem addressed by the "writeback throttling" code. Some of the WBT patch submissions include problem descriptions, showing the horrific delays this causes.

Similarly, you could fill up the 15% entirely with writes to a USB stick. Further write()s to the USB will be throttled. But the main disk would not be using any of its fair share. If you start calling write() on your main filesystem then it will not be throttled, or will at least be delayed much less. And I think the USB write()s would be throttled even more, to bring the two writers into balance.

I expect the overall writeback cache could temporarily rise above the setpoint. In some more demanding cases, you can hit the hard limit on overall writeback cache. The hard limit defaults to 20% of available memory; the configuration option is dirty_ratio / dirty_bytes. Perhaps you can hit this because a device can slow down (perhaps because of a more random I/O pattern), and dirty throttling does not recognize the change in speed immediately.


[2] You might notice this document suggests you can manually limit the proportion of writeback cache, that can be used for a specific partition/filesystem. The setting is called /sys/class/bdi/*/max_ratio. Be aware that "if the device you'd like to limit is the only one which is currently written to, the limiting doesn't have a big effect."