performance – System Lags During Large R/W Operations on External Disks

ddioperformanceusb-drive

I am having some issues with system-wide latency/lagging when doing large disk imaging operations on an Ubuntu 18.04 system. Here's the system specs:

Processor: Intel Core i7 (never near capacity on any core)

Memory: 12GB (never near capacity)

System disk: SSD (never near capacity)

External disks: USB 3.0 5400 and 7200RPM spinning disks

These large disk imaging operations are basically:

nice ionice dd if=/dev/usbdisk1 of=/dev/usbdisk2

Since none of my system files are on any USB disks, in theory, this shouldn't introduce much latency. But I find when I'm imaging more than one USB disk, the system just comes to a crawl. Why? My understanding is that each disk has its own IO queue, so what's going on here? How can I remedy it?

Also, FWIW, I don't care at all about the imaging speed of the USB disks, so solutions which slow these operations in favor of the system running smoothly are fine by me.

Best Answer

How can I remedy it?

When you write a disk image, use dd with oflag=direct. The O_DIRECT writes will avoid writing the data through the page cache. Note oflag=direct will require a larger block size, in order to get good performance. Here is an example:

dd if=/dev/usbdisk1 of=/dev/usbdisk2 oflag=direct bs=32M status=progress

NOTE: Sometimes you might want to pipe a disk image from another program, such as gunzip. In this case, good performance also depends on iflag=fullblock and piping through another dd command. There is a full example in the answer here: Why does a gunzip to dd pipeline slow down at the end?

(An alternative solution is to use oflag=sync instead of oflag=direct. This works by not building up a lot of unwritten cache pages).

My understanding is that each disk has its own IO queue, so what's going on here?

They do. However, the written data is first stored in the system page cache (in RAM), before queuing IO...


EDIT:

Since this answer was accepted, I assume you re-tested with oflag=direct, and it fixes your problem where "the system just comes to a crawl". Great.

The safest option would be to add iflag=direct as well. Without this option, dd is still reading data through the system page cache. I assume you did not add this option without telling me. This is one hint towards your specific problem.

It should be clear that reading too much data through the page cache could affect system performance. The total amount of data you are pushing through the page cache is several times larger than your system RAM :-). Depending on the pattern of reads, the kernel could decide to start dropping (or swapping) other cached data to make space.

The kernel does not have infallible foresight. If you need to use the data that was dropped from the cache, it will have to be re-loaded from your disk/SSD. The evidence seems to tell us this is not your problem.

Dirty page cache limits

However, more likely your problem has to do with writing data through the page cache. The unwritten cache, aka "dirty" page cache, is limited. For example you can imagine the overall dirty page cache is limited to 20% of RAM. (This is a convenient lie to imagine. The truth is messily written here).

If your dd command(s) manage to fill the maximum dirty page cache, they will be forced to "block" (wait) until some of the data has been written out.

But at the same time, any other program which wants to write will also be blocked (unless it uses O_DIRECT). This can stall a lot of your desktop programs e.g. when they try to write log files. Even though they are writing to a different device.

The overall dirty limit is named dirty_ratio or dirty_bytes. But the full story is much more complicated. There is supposed to be some level of arbitration between the dirty cache for different devices. There are earlier thresholds that kick in, and try to limit the proportion of the maximum dirty cache used by any one device. It is hard to understand exactly how well it all works though.

I think you mention you have a problem when imaging "more than one USB disk". For example maybe the per-device thresholds work well when you are trying to write one of your disks, but break down once you are writing more than one at the same time. But that's just a thought; I don't know exactly what's happening.

Related:

Some users have observed their whole system lags when they write to slow USB sticks, and found that lowering the overall dirty limit helped avoid the lag. I do not know a good explanation for this.

Why were "USB-stick stall" problems reported in 2013? Why wasn't this problem solved by the existing "No-I/O dirty throttling" code?

Is "writeback throttling" a solution to the "USB-stick stall problem"?

Related Question