I am neither concerned about RAM usage (as I've got enough) nor about losing data in case of an accidental shut-down (as my power is backed, the system is reliable and the data are not critical). But I do a lot of file processing and could use some performance boost.
That's why I'd like to set the system up to use more RAM for file system read and write caching, to prefetch files aggressively (e.g. read-ahead the whole file accessed by an application in case the file is of sane size or at least read-ahead a big chunk of it otherwise) and to flush writing buffers less frequently. How to achieve this (may it be possible)?
I use ext3 and ntfs (I use ntfs a lot!) file systems with XUbuntu 11.10 x86.
Best Answer
Improving disk cache performance in general is more than just increasing the file system cache size unless your whole system fits in RAM in which case you should use RAM drive (
tmpfs
is good because it allows falling back to disk if you need the RAM in some case) for runtime storage (and perhaps an initrd script to copy system from storage to RAM drive at startup).You didn't tell if your storage device is SSD or HDD. Here's what I've found to work for me (in my case
sda
is a HDD mounted at/home
andsdb
is SSD mounted at/
).First optimize the load-stuff-from-storage-to-cache part:
Here's my setup for HDD (make sure AHCI+NCQ is enabled in BIOS if you have toggles):
Worth noting for the HDD case is high
fifo_expire_async
(usually write) and longslice_sync
to allow a single process to get high throughput (setslice_sync
to lower number if you hit situations where multiple processes are waiting for some data from the disk in parallel). Theslice_idle
is always a compromise for HDDs but setting it somewhere in range 3-20 should be okay depending on disk usage and disk firmware. I prefer to target for low values but setting it too low will destroy your throughput. Thequantum
setting seems to affect throughput a lot but try to keep this as low as possible to keep latency on sensible level. Settingquantum
too low will destroy throughput. Values in range 3-8 seem to work well with HDDs. The worst case latency for a read is (quantum
*slice_sync
) + (slice_async_rq
*slice_async
) ms if I've understood the kernel behavior correctly. The async is mostly used by writes and since you're willing to delay writing to disk, set bothslice_async_rq
andslice_async
to very low numbers. However, settingslice_async_rq
too low value may stall reads because writes cannot be delayed after reads any more. My config will try to write data to disk at most after 10 seconds after data has been passed to kernel but since you can tolerate loss of data on power loss also setfifo_expire_async
to3600000
to tell that 1 hour is okay for the delay to disk. Just keep theslice_async
low, though, because otherwise you can get high read latency.The
hdparm
command is required to prevent AAM from killing much of the performance that AHCI+NCQ allows. If your disk makes too much noise, then skip this.Here's my setup for SSD (Intel 320 series):
Here it's worth noting the low values for different slice settings. The most important setting for an SSD is
slice_idle
which must be set to 0-1. Setting it to zero moves all ordering decisions to native NCQ while setting it to 1 allows kernel to order requests (but if the NCQ is active, the hardware may override kernel ordering partially). Test both values to see if you can see the difference. For Intel 320 series, it seems that settingslide_idle
to0
gives the best throughput but setting it to1
gives best (lowest) overall latency.For more information about these tunables, see https://www.kernel.org/doc/Documentation/block/cfq-iosched.txt .
Update in year 2020 and kernel version 5.3 (cfq is dead):
The setup is pretty similar but I now use
bfq
instead ofcfq
because latter is not available with modern kernels. I try to keepnr_requests
as low as possible to allowbfq
to control the scheduling more accurately. At least Samsung SSD drives seem to require pretty deep queue to be able to run with high IOPS.I'm using Ubuntu 18.04 with kernel package
linux-lowlatency-hwe-18.04-edge
which hasbfq
only as module so I need to load it before being able to switch to it.I also nowadays also use
zram
but I only use 5% of RAM for zram. This allows Linux kernel to use swapping related logic without touching the disks. However, if you decide to go with zero disk swap, make sure your apps do not leak RAM or you're wasting money.Now that we have configured kernel to load stuff from disk to cache with sensible performance, it's time to adjust the cache behavior:
According to benchmarks I've done, I wouldn't bother setting read ahead via
blockdev
at all. Kernel default settings are fine.Set system to prefer swapping file data over application code (this does not matter if you have enough RAM to keep whole filesystem and all the application code and all virtual memory allocated by applications in RAM). This reduces latency for swapping between different applications over latency for accessing big files from a single application:
If you prefer to keep applications nearly always in RAM you could set this to 1. If you set this to zero, kernel will not swap at all unless absolutely necessary to avoid OOM. If you were memory limited and working with big files (e.g. HD video editing), then it might make sense to set this close to 100.
I nowadays (2017) prefer to have no swap at all if you have enough RAM. Having no swap will usually lose 200-1000 MB of RAM on long running desktop machine. I'm willing to sacrifice that much to avoid worst case scenario latency (swapping application code in when RAM is full). In practice, this means that I prefer OOM Killer to swapping. If you allow/need swapping, you might want to increase
/proc/sys/vm/watermark_scale_factor
, too, to avoid some latency. I would suggest values between 100 and 500. You can consider this setting as trading CPU usage for lower swap latency. Default is 10 and maximum possible is 1000. Higher value should (according to kernel documentation) result in higher CPU usage forkswapd
processes and lower overall swapping latency.Next, tell kernel to prefer keeping directory hierarchy in memory over file contents in case some RAM needs to be freed (again, if everything fits in RAM, this setting does nothing):
Setting
vfs_cache_pressure
to low value makes sense because in most cases, the kernel needs to know the directory structure before it can use file contents from the cache and flushing the directory cache too soon will make the file cache next to worthless. Consider going all the way down to 1 with this setting if you have lots of small files (my system has around 150K 10 megapixel photos and counts as "lots of small files" system). Never set it to zero or directory structure is always kept in memory even if the system is running out of the memory. Setting this to big value is sensible only if you have only a few big files that are constantly being re-read (again, HD video editing without enough RAM would be an example case). Official kernel documentation says that "increasing vfs_cache_pressure significantly beyond 100 may have negative performance impact".Exception: if you have truly massive amount of files and directories and you rarely touch/read/list all files setting
vfs_cache_pressure
higher than 100 may be wise. This only applies if you do not have enough RAM and cannot keep whole directory structure in RAM and still having enough RAM for normal file cache and processes (e.g. company wide file server with lots of archival content). If you feel that you need to increasevfs_cache_pressure
above 100 you're running without enough RAM. Increasingvfs_cache_pressure
may help but the only real fix is to get more RAM. Havingvfs_cache_pressure
set to high number sacrifices average performance for having more stable performance overall (that is, you can avoid really bad worst case behavior but have to deal with worse overall performance).Finally tell the kernel to use up to 99% of the RAM as cache for writes and instruct kernel to use up to 50% of RAM before slowing down the process that's writing (default for
dirty_background_ratio
is10
). Warning: I personally would not do this but you claimed to have enough RAM and are willing to lose the data.And tell that 1h write delay is ok to even start writing stuff on the disk (again, I would not do this):
For more information about these tunables, see https://www.kernel.org/doc/Documentation/sysctl/vm.txt
If you put all of those to
/etc/rc.local
and include following at the end, everything will be in cache as soon as possible after boot (only do this if your filesystem really fits in the RAM):Or a bit simpler alternative which might work better (cache only
/home
and/usr
, only do this if your/home
and/usr
really fit in RAM):