Linux – Prevent large file write from freezing the system

linux

So on my Linux desktop, I'm writing some large file either to a local disk or an NFS mount.

There is some kind of system buffer that the to-be-written data is cached in. (Something in the range of 0.5-2GB on my system, I think?)

If the buffer is full, all file access blocks, effectively freezing the system until the write is done. (I'm pretty sure even read access is blocked.)

What do I need to configure to make sure that never happens?

What I want is:

If a process can't write data to disk (or network mount etc) fast enough, that process can block until the disk catches up, but other processes can still read/write data at a reasonable rate and latency without any interruption.

Ideally, I'd be able to set how much of the total read/write rate of the dsik is available to a certain type of program (cp, git, mplayer, firefox, etc), like "all mplayer processes together get at least 10MB/s, no matter what the rest of the system is doing". But "all mplayer instances together get at least 50% of the total rate, no matter what" is fine too. (ie, I don't care much if I can set absolute rates or proportions of the total rate).

More importantly (because most important read/writes are small), I want a similar setup for latency. Again, I'd have a guarantee that a single process's read/write can't block the rest of the system for more than say 10 ms (or whatever), no matter what. Ideally, I'd have a guarantee like "mplayer never has to wait more than 10ms for a read/write to get handled, no matter what the system is doing".

This must work no matter how the offending process got started (including what user it's running under etc), so "wrap a big cp in ionice" or whatever is only barely useful. It would only prevent some tasks from predictably freezing everything if I remember to ionice them, but what about a cron job, an exec call from some running daemon, etc?

(I guess I could wrap the worst offenders with a shell script that always ionices them, but even then, looking through ionice's man page, it seems to be somewhat vague about what exact guarantees it gives me, so I'd prefer a more systematic and maintainable alternative.)

Best Answer

Typically Linux uses a cache to asynchronously write the data to the disk. However, it may happen that the time span between the write request and the actual write or the amount of unwritten (dirty) data becomes very large. In this situation a crash would result in a huge data loss and for this reason Linux switches to synchronous writes if the dirty cache becomes to large or old. As the write order has to be respected as well, you cannot just bypass a small IO without guaranteeing, that the small IO is completely independent of all earlier queued writes. Thus, depended writes may cause a huge delay. (This kind of dependencies may also be caused on the file system level: see https://ext4.wiki.kernel.org/index.php/Ext3_Data%3DOrdered_vs_Data%3DWriteback_mode).

My guess is, that you are experiencing some kind of buffer bloat in combination with dependent writes. If you write a large file and have a large disk cache, you end up in situations where a huge amount of data has to be written before a synchronous write can be done. There is a good article on LWN, describing the problem: https://lwn.net/Articles/682582/

Work on schedulers is still going on and the situation may get better with new kernel versions. However, up to then: There are a few switches that can influence the caching behavior on Linux (there are more, see: https://www.kernel.org/doc/Documentation/sysctl/vm.txt):

dirty_ratio: Contains, as a percentage of total available memory that contains free pages and reclaimable pages, the number of pages at which a process which is generating disk writes will itself start writing out dirty data. The total available memory is not equal to total system memory.
dirty_background_ratio: Contains, as a percentage of total available memory that contains free pages and reclaimable pages, the number of pages at which the background kernel flusher threads will start writing out dirty data.
dirty_writeback_centisecs: The kernel flusher threads will periodically wake up and write `old' data out to disk. This tunable expresses the interval between those wakeups, in 100'ths of a second. Setting this to zero disables periodic writeback altogether.
dirty_expire_centisecs: This tunable is used to define when dirty data is old enough to be eligible for writeout by the kernel flusher threads. It is expressed in 100'ths of a second. Data which has been dirty in-memory for longer than this interval will be written out next time a flusher thread wakes up.

The easiest solution to reduce the maximum latency in such situations is to reduce the maximal amount of dirty disk cache and cause the background job to do early writes. Of course this may result in a performance degradation in situations where an otherwise large cache would prevent synchronous writes at all. For example you can configure the following in /etc/sysctl.conf:

vm.dirty_background_ratio = 1
vm.dirty_ratio = 5

Please note, that the values suitable for you system depend on the amount of available RAM and the disk speed. In extreme conditions, the above dirty ration might still be to large. E.g., if you have 100GiB available RAM and you disk writes with a speed of about 100MiB, the above settings would result a maximal amount of 5GiB dirty cache and that may take about 50 seconds to write. With dirty_bytes and dirty_background_bytes you can also set the values for the cache in an absolute manner.

Another thing you can try out is to switch the io scheduler. In current kernel releases, there are noop, deadline, and cfq. If you are using an older kernel you might experience a better reaction time with the deadline scheduler compared to cfq. However, you have to test it. Noop should be avoided in your situation. There is also the non-mainline BFQ scheduler which claims to reduce latency compared to CFQ (http://algo.ing.unimo.it/people/paolo/disk_sched/). However, it is not included in all distributions. You can check and switch the scheduler on runtime with:

cat /sys/block/sdX/queue/scheduler 
echo <SCHEDULER_NAME> > /sys/block/sdX/queue/scheduler

The first command will give you also a summary of the available schedulers and their exact names. Please note: The setting is lost after reboot. To choose the schedular permanently you can add a kernel parameter:

elevator=<SCHEDULER_NAME>

The situation for NFS is similar, but includes other problems. The following two bug reports may give some inside about the handling stat on NFS and why a large file write can cause stat to be very slow:

https://bugzilla.redhat.com/show_bug.cgi?id=688232 https://bugzilla.redhat.com/show_bug.cgi?id=469848

Update: (14.08.2017) With kernel 4.10 a new kernel options CONFIG_BLK_WBT and its sub-options BLK_WBT_SQ and CONFIG_BLK_WBT_MQ have been introduced. They are preventing buffer bloats that are caused by hardware buffers, which's sizes and prioritization cannot be controlled by the kernel:

Enabling this option enables the block layer to throttle buffered
background writeback from the VM, making it more smooth and having
less impact on foreground operations. The throttling is done
dynamically on an algorithm loosely based on CoDel, factoring in
the realtime performance of the disk

Furthermore, the BFQ-Scheduler is mainlined with kernel 4.12.

Related Solutions

Linux – configure the Linux system for more aggressive file system caching

Improving disk cache performance in general is more than just increasing the file system cache size unless your whole system fits in RAM in which case you should use RAM drive (tmpfs is good because it allows falling back to disk if you need the RAM in some case) for runtime storage (and perhaps an initrd script to copy system from storage to RAM drive at startup).

You didn't tell if your storage device is SSD or HDD. Here's what I've found to work for me (in my case sda is a HDD mounted at /home and sdb is SSD mounted at /).

First optimize the load-stuff-from-storage-to-cache part:

Here's my setup for HDD (make sure AHCI+NCQ is enabled in BIOS if you have toggles):

echo cfq > /sys/block/sda/queue/scheduler
echo 10000 > /sys/block/sda/queue/iosched/fifo_expire_async
echo 250 > /sys/block/sda/queue/iosched/fifo_expire_sync
echo 80 > /sys/block/sda/queue/iosched/slice_async
echo 1 > /sys/block/sda/queue/iosched/low_latency
echo 6 > /sys/block/sda/queue/iosched/quantum
echo 5 > /sys/block/sda/queue/iosched/slice_async_rq
echo 3 > /sys/block/sda/queue/iosched/slice_idle
echo 100 > /sys/block/sda/queue/iosched/slice_sync
hdparm -q -M 254 /dev/sda

Worth noting for the HDD case is high fifo_expire_async (usually write) and long slice_sync to allow a single process to get high throughput (set slice_sync to lower number if you hit situations where multiple processes are waiting for some data from the disk in parallel). The slice_idle is always a compromise for HDDs but setting it somewhere in range 3-20 should be okay depending on disk usage and disk firmware. I prefer to target for low values but setting it too low will destroy your throughput. The quantum setting seems to affect throughput a lot but try to keep this as low as possible to keep latency on sensible level. Setting quantum too low will destroy throughput. Values in range 3-8 seem to work well with HDDs. The worst case latency for a read is (quantum * slice_sync) + (slice_async_rq * slice_async) ms if I've understood the kernel behavior correctly. The async is mostly used by writes and since you're willing to delay writing to disk, set both slice_async_rq and slice_async to very low numbers. However, setting slice_async_rq too low value may stall reads because writes cannot be delayed after reads any more. My config will try to write data to disk at most after 10 seconds after data has been passed to kernel but since you can tolerate loss of data on power loss also set fifo_expire_async to 3600000 to tell that 1 hour is okay for the delay to disk. Just keep the slice_async low, though, because otherwise you can get high read latency.

The hdparm command is required to prevent AAM from killing much of the performance that AHCI+NCQ allows. If your disk makes too much noise, then skip this.

Here's my setup for SSD (Intel 320 series):

echo cfq > /sys/block/sdb/queue/scheduler
echo 1 > /sys/block/sdb/queue/iosched/back_seek_penalty
echo 10000 > /sys/block/sdb/queue/iosched/fifo_expire_async
echo 20 > /sys/block/sdb/queue/iosched/fifo_expire_sync
echo 1 > /sys/block/sdb/queue/iosched/low_latency
echo 6 > /sys/block/sdb/queue/iosched/quantum
echo 2 > /sys/block/sdb/queue/iosched/slice_async
echo 10 > /sys/block/sdb/queue/iosched/slice_async_rq
echo 1 > /sys/block/sdb/queue/iosched/slice_idle
echo 20 > /sys/block/sdb/queue/iosched/slice_sync

Here it's worth noting the low values for different slice settings. The most important setting for an SSD is slice_idle which must be set to 0-1. Setting it to zero moves all ordering decisions to native NCQ while setting it to 1 allows kernel to order requests (but if the NCQ is active, the hardware may override kernel ordering partially). Test both values to see if you can see the difference. For Intel 320 series, it seems that setting slide_idle to 0 gives the best throughput but setting it to 1 gives best (lowest) overall latency.

For more information about these tunables, see https://www.kernel.org/doc/Documentation/block/cfq-iosched.txt .

Update in year 2020 and kernel version 5.3 (cfq is dead):

modprobe bfq
for d in /sys/block/sd?
do
        # HDD (tuned for Seagate SMR drive)
        echo bfq > "$d/queue/scheduler"
        echo 4 > "$d/queue/nr_requests"
        echo 32000 > "$d/queue/iosched/back_seek_max"
        echo 3 > "$d/queue/iosched/back_seek_penalty"
        echo 80 > "$d/queue/iosched/fifo_expire_sync"
        echo 1000 > "$d/queue/iosched/fifo_expire_async"
        echo 5300 > "$d/queue/iosched/slice_idle_us"
        echo 1 > "$d/queue/iosched/low_latency"
        echo 200 > "$d/queue/iosched/timeout_sync"
        echo 0 > "$d/queue/iosched/max_budget"
        echo 1 > "$d/queue/iosched/strict_guarantees"

        # additional tweaks for SSD (tuned for Samsung EVO 850):
        if test $(cat "$d/queue/rotational") = "0"
        then
                echo 36 > "$d/queue/nr_requests"
                echo 1 > "$d/queue/iosched/back_seek_penalty"
                # slice_idle_us should be ~ 0.7/IOPS in µs
                echo 16 > "$d/queue/iosched/slice_idle_us"
                echo 10 > "$d/queue/iosched/fifo_expire_sync"
                echo 250 > "$d/queue/iosched/fifo_expire_async"
                echo 10 > "$d/queue/iosched/timeout_sync"
                echo 0 > "$d/queue/iosched/strict_guarantees"
        fi
done

The setup is pretty similar but I now use bfq instead of cfq because latter is not available with modern kernels. I try to keep nr_requests as low as possible to allow bfq to control the scheduling more accurately. At least Samsung SSD drives seem to require pretty deep queue to be able to run with high IOPS.

I'm using Ubuntu 18.04 with kernel package linux-lowlatency-hwe-18.04-edge which has bfq only as module so I need to load it before being able to switch to it.

I also nowadays also use zram but I only use 5% of RAM for zram. This allows Linux kernel to use swapping related logic without touching the disks. However, if you decide to go with zero disk swap, make sure your apps do not leak RAM or you're wasting money.

Now that we have configured kernel to load stuff from disk to cache with sensible performance, it's time to adjust the cache behavior:

According to benchmarks I've done, I wouldn't bother setting read ahead via blockdev at all. Kernel default settings are fine.

Set system to prefer swapping file data over application code (this does not matter if you have enough RAM to keep whole filesystem and all the application code and all virtual memory allocated by applications in RAM). This reduces latency for swapping between different applications over latency for accessing big files from a single application:

echo 15 > /proc/sys/vm/swappiness

If you prefer to keep applications nearly always in RAM you could set this to 1. If you set this to zero, kernel will not swap at all unless absolutely necessary to avoid OOM. If you were memory limited and working with big files (e.g. HD video editing), then it might make sense to set this close to 100.

I nowadays (2017) prefer to have no swap at all if you have enough RAM. Having no swap will usually lose 200-1000 MB of RAM on long running desktop machine. I'm willing to sacrifice that much to avoid worst case scenario latency (swapping application code in when RAM is full). In practice, this means that I prefer OOM Killer to swapping. If you allow/need swapping, you might want to increase /proc/sys/vm/watermark_scale_factor, too, to avoid some latency. I would suggest values between 100 and 500. You can consider this setting as trading CPU usage for lower swap latency. Default is 10 and maximum possible is 1000. Higher value should (according to kernel documentation) result in higher CPU usage for kswapd processes and lower overall swapping latency.

Next, tell kernel to prefer keeping directory hierarchy in memory over file contents in case some RAM needs to be freed (again, if everything fits in RAM, this setting does nothing):

echo 10 > /proc/sys/vm/vfs_cache_pressure

Setting vfs_cache_pressure to low value makes sense because in most cases, the kernel needs to know the directory structure before it can use file contents from the cache and flushing the directory cache too soon will make the file cache next to worthless. Consider going all the way down to 1 with this setting if you have lots of small files (my system has around 150K 10 megapixel photos and counts as "lots of small files" system). Never set it to zero or directory structure is always kept in memory even if the system is running out of the memory. Setting this to big value is sensible only if you have only a few big files that are constantly being re-read (again, HD video editing without enough RAM would be an example case). Official kernel documentation says that "increasing vfs_cache_pressure significantly beyond 100 may have negative performance impact".

Exception: if you have truly massive amount of files and directories and you rarely touch/read/list all files setting vfs_cache_pressure higher than 100 may be wise. This only applies if you do not have enough RAM and cannot keep whole directory structure in RAM and still having enough RAM for normal file cache and processes (e.g. company wide file server with lots of archival content). If you feel that you need to increase vfs_cache_pressure above 100 you're running without enough RAM. Increasing vfs_cache_pressure may help but the only real fix is to get more RAM. Having vfs_cache_pressure set to high number sacrifices average performance for having more stable performance overall (that is, you can avoid really bad worst case behavior but have to deal with worse overall performance).

Finally tell the kernel to use up to 99% of the RAM as cache for writes and instruct kernel to use up to 50% of RAM before slowing down the process that's writing (default for dirty_background_ratio is 10). Warning: I personally would not do this but you claimed to have enough RAM and are willing to lose the data.

echo 99 > /proc/sys/vm/dirty_ratio
echo 50 > /proc/sys/vm/dirty_background_ratio

And tell that 1h write delay is ok to even start writing stuff on the disk (again, I would not do this):

echo 360000 > /proc/sys/vm/dirty_expire_centisecs
echo 360000 > /proc/sys/vm/dirty_writeback_centisecs

For more information about these tunables, see https://www.kernel.org/doc/Documentation/sysctl/vm.txt

If you put all of those to /etc/rc.local and include following at the end, everything will be in cache as soon as possible after boot (only do this if your filesystem really fits in the RAM):

(nice find / -type f -and -not -path '/sys/*' -and -not -path '/proc/*' -print0 2>/dev/null | nice ionice -c 3 wc -l --files0-from - > /dev/null)&

Or a bit simpler alternative which might work better (cache only /home and /usr, only do this if your /home and /usr really fit in RAM):

(nice find /home /usr -type f -print0 | nice ionice -c 3 wc -l --files0-from - > /dev/null)&

Linux – VM benchmark tools

While you could test all of those independently I would recommend the Phoronix Test Suite.

From the Site:
The Phoronix Test Suite is the most comprehensive testing and benchmarking platform available that provides an extensible framework for which new tests can be easily added. The software is designed to effectively carry out both qualitative and quantitative benchmarks in a clean, reproducible, and easy-to-use manner.

The Phoronix Test Suite can be adapted to run on platforms ranging from smartphones and personal computers to multi-core workstations and cloud computing infrastructures. (VM: essentially cloud)

Extensible Architecture: The Phoronix Test Suite ships with more than 130 test profiles and 60 test suites. These tests range from battery power consumption monitoring for mobile devices to multi-threaded ray-tracing benchmarks and span the CPU, graphics, system memory, disk storage, and motherboard components. If there is a test though not currently covered by the Phoronix Test Suite, new tests can be quickly added via its extensible architecture (see documentation) with each profile just being comprised of XML files and a few simple scripts.

Edit:
Note: I have never used the product only been told about it recently from a friend who did. He had some MySQL specific tests that he wanted to do that (according to him) no benchmark could do. He knew what to do for the test and was able to use their framework to integrate the tests he wanted.

Phoronix Test Suite: Home Page
Phoronix Test Suite: Features

Best Answer

Related Solutions

Linux – configure the Linux system for more aggressive file system caching

Linux – VM benchmark tools

Related Question