Linux – Prevent large file write from freezing the system

linux

So on my Linux desktop, I'm writing some large file either to a local disk or an NFS mount.

There is some kind of system buffer that the to-be-written data is cached in. (Something in the range of 0.5-2GB on my system, I think?)

If the buffer is full, all file access blocks, effectively freezing the system until the write is done. (I'm pretty sure even read access is blocked.)

What do I need to configure to make sure that never happens?

What I want is:

If a process can't write data to disk (or network mount etc) fast enough, that process can block until the disk catches up, but other processes can still read/write data at a reasonable rate and latency without any interruption.

Ideally, I'd be able to set how much of the total read/write rate of the dsik is available to a certain type of program (cp, git, mplayer, firefox, etc), like "all mplayer processes together get at least 10MB/s, no matter what the rest of the system is doing". But "all mplayer instances together get at least 50% of the total rate, no matter what" is fine too. (ie, I don't care much if I can set absolute rates or proportions of the total rate).

More importantly (because most important read/writes are small), I want a similar setup for latency. Again, I'd have a guarantee that a single process's read/write can't block the rest of the system for more than say 10 ms (or whatever), no matter what. Ideally, I'd have a guarantee like "mplayer never has to wait more than 10ms for a read/write to get handled, no matter what the system is doing".

This must work no matter how the offending process got started (including what user it's running under etc), so "wrap a big cp in ionice" or whatever is only barely useful. It would only prevent some tasks from predictably freezing everything if I remember to ionice them, but what about a cron job, an exec call from some running daemon, etc?

(I guess I could wrap the worst offenders with a shell script that always ionices them, but even then, looking through ionice's man page, it seems to be somewhat vague about what exact guarantees it gives me, so I'd prefer a more systematic and maintainable alternative.)

Best Answer

Typically Linux uses a cache to asynchronously write the data to the disk. However, it may happen that the time span between the write request and the actual write or the amount of unwritten (dirty) data becomes very large. In this situation a crash would result in a huge data loss and for this reason Linux switches to synchronous writes if the dirty cache becomes to large or old. As the write order has to be respected as well, you cannot just bypass a small IO without guaranteeing, that the small IO is completely independent of all earlier queued writes. Thus, depended writes may cause a huge delay. (This kind of dependencies may also be caused on the file system level: see https://ext4.wiki.kernel.org/index.php/Ext3_Data%3DOrdered_vs_Data%3DWriteback_mode).

My guess is, that you are experiencing some kind of buffer bloat in combination with dependent writes. If you write a large file and have a large disk cache, you end up in situations where a huge amount of data has to be written before a synchronous write can be done. There is a good article on LWN, describing the problem: https://lwn.net/Articles/682582/

Work on schedulers is still going on and the situation may get better with new kernel versions. However, up to then: There are a few switches that can influence the caching behavior on Linux (there are more, see: https://www.kernel.org/doc/Documentation/sysctl/vm.txt):

  • dirty_ratio: Contains, as a percentage of total available memory that contains free pages and reclaimable pages, the number of pages at which a process which is generating disk writes will itself start writing out dirty data. The total available memory is not equal to total system memory.
  • dirty_background_ratio: Contains, as a percentage of total available memory that contains free pages and reclaimable pages, the number of pages at which the background kernel flusher threads will start writing out dirty data.
  • dirty_writeback_centisecs: The kernel flusher threads will periodically wake up and write `old' data out to disk. This tunable expresses the interval between those wakeups, in 100'ths of a second. Setting this to zero disables periodic writeback altogether.
  • dirty_expire_centisecs: This tunable is used to define when dirty data is old enough to be eligible for writeout by the kernel flusher threads. It is expressed in 100'ths of a second. Data which has been dirty in-memory for longer than this interval will be written out next time a flusher thread wakes up.

The easiest solution to reduce the maximum latency in such situations is to reduce the maximal amount of dirty disk cache and cause the background job to do early writes. Of course this may result in a performance degradation in situations where an otherwise large cache would prevent synchronous writes at all. For example you can configure the following in /etc/sysctl.conf:

vm.dirty_background_ratio = 1
vm.dirty_ratio = 5

Please note, that the values suitable for you system depend on the amount of available RAM and the disk speed. In extreme conditions, the above dirty ration might still be to large. E.g., if you have 100GiB available RAM and you disk writes with a speed of about 100MiB, the above settings would result a maximal amount of 5GiB dirty cache and that may take about 50 seconds to write. With dirty_bytes and dirty_background_bytes you can also set the values for the cache in an absolute manner.

Another thing you can try out is to switch the io scheduler. In current kernel releases, there are noop, deadline, and cfq. If you are using an older kernel you might experience a better reaction time with the deadline scheduler compared to cfq. However, you have to test it. Noop should be avoided in your situation. There is also the non-mainline BFQ scheduler which claims to reduce latency compared to CFQ (http://algo.ing.unimo.it/people/paolo/disk_sched/). However, it is not included in all distributions. You can check and switch the scheduler on runtime with:

cat /sys/block/sdX/queue/scheduler 
echo <SCHEDULER_NAME> > /sys/block/sdX/queue/scheduler

The first command will give you also a summary of the available schedulers and their exact names. Please note: The setting is lost after reboot. To choose the schedular permanently you can add a kernel parameter:

elevator=<SCHEDULER_NAME>

The situation for NFS is similar, but includes other problems. The following two bug reports may give some inside about the handling stat on NFS and why a large file write can cause stat to be very slow:

https://bugzilla.redhat.com/show_bug.cgi?id=688232 https://bugzilla.redhat.com/show_bug.cgi?id=469848

Update: (14.08.2017) With kernel 4.10 a new kernel options CONFIG_BLK_WBT and its sub-options BLK_WBT_SQ and CONFIG_BLK_WBT_MQ have been introduced. They are preventing buffer bloats that are caused by hardware buffers, which's sizes and prioritization cannot be controlled by the kernel:

Enabling this option enables the block layer to throttle buffered
background writeback from the VM, making it more smooth and having
less impact on foreground operations. The throttling is done
dynamically on an algorithm loosely based on CoDel, factoring in
the realtime performance of the disk

Furthermore, the BFQ-Scheduler is mainlined with kernel 4.12.

Related Question