Linux – Why does the Linux system stutter unless I continuously drop caches

cachelinuxmemoryperformance

Over the past few months, I've had an extremely irritating problem with my Linux system: it stutters on Firefox audio playback, mouse movement, etc., with a tiny sub-second (but still noticeable) stutter every few seconds. The issue worsens as the memory cache fills, or when I have highly disk/memory intensive programs running (e.g. backup software restic). However, when the cache isn't full (e.g. under very light load), everything runs very smoothly.

Looking through perf top output, I see that list_lru_count_one has high overhead (~20%) during these periods of lag. htop also shows kswapd0 using 50-90% CPU (though it feels like the impact is much greater than that). During times of extreme lag, the htop CPU meter is often dominated by kernel CPU usage.

The only workaround I've found is either force the kernel to keep free memory (sysctl -w vm.min_free_kbytes=1024000) or to continuously drop memory caches through echo 3 > /proc/sys/vm/drop_caches. Neither is ideal, of course, and neither completely solves the stuttering either; it only makes it less frequent.

Does anyone have any ideas on why this might be occurring?

System Info

i7-4820k with 20 GB of (mismatched) DDR3 RAM
Reproduced on Linux 4.14-4.18 on NixOS unstable
Runs Docker containers and Kubernetes in the background (which I feel like shouldn't create microstuttering?)

What I've already tried

Changing I/O schedulers (bfq), using multiqueue I/O schedulers
Using the -ck patchset by Con Kolivas (didn't help)
Disabling swap, changing swappiness, using zram

EDIT: For clarity, here's a picture of htop and perf during such a lag spike. Note the high list_lru_count_one CPU load and the kswapd0 + high kernel CPU usage.

Best Answer

It sounds like you've already tried many of the things I would have suggested at first (tweaking swap configuration, changing I/O schedulers, etc).

Aside from what you've already tried changing, I would suggest looking into changing the somewhat brain-dead defaults for the VM writeback behavior. This is managed by the following six sysctl values:

vm.dirty_ratio: Controls how much writes must be pending for writeback before it will be triggered. Handles foreground (per-process) writeback, and is expressed as a integer percentage of RAM. Defaults to 10% of RAM
vm.dirty_background_ratio: Controls how much writes must be pending for writeback before it will be triggered. Handles background (system-wide) writeback, and is expressed as a integer percentage of RAM. Defaults to 20% of RAM
vm.dirty_bytes: Same as vm.dirty_ratio, except expressed as a total number of bytes. Either this or vm.dirty_ratio will be used, whichever was written to last.
vm.dirty_background_bytes: Same as vm.dirty_background_ratio, except expressed as a total number of bytes. Either this or vm.dirty_background_ratio will be used, whichever was written to last.
vm.dirty_expire_centisecs: How many hundredths of a second must pass before pending writeback starts when the above four sysctl values would not already trigger it. Defaults to 100 (one second).
vm.dirty_writeback_centisecs: How often (in hundredths of a second) the kernel will evaluate dirty pages for writeback. Defaults to 10 (one tenth of a second).

So, with the default values, every tenth of a second, the kernel will do the following:

Write out any modified pages to persistent storage if they were last modified more than a second ago.
Write out all modified pages for a process if it's total amount of modified memory that hasn't been written out exceeds 10% of RAM.
Write out all modified pages in the system if the total amount of modified memory that hasn't been written out exceeds 20% of RAM.

So, it should be pretty easy to see why the default values may be causing issues for you, because your system might be trying to write out up to 4 gigabytes of data to persistent storage every tenth of a second.

The general consensus these days is to adjust vm.dirty_ratio to be 1% of RAM, and vm.dirty_background_ratio to be 2%, which for systems with less than about 64GB of RAM results in behavior equivalent to what was originally intended.

Some other things to look into:

Try increasing the vm.vfs_cache_pressure sysctl a bit. This controls how aggressively the kernel reclaims memory from the filesystem cache when it needs RAM. The default is 100, don't lower it to anything below 50 (you will get really bad behavior if you go below 50, including OOM conditions), and don't raise it to much more than about 200 (much higher, and the kernel will waste time trying to reclaim memory it really can't). I've found that bumping it up to 150 actually visibly improves responsiveness if you have reasonably fast storage.
Try changing the memory overcommit mode. This can be done by altering the value of the vm.overcommit_memory sysctl. By default, the kernel will use a heuristic approach to try and predict how much RAM it can actually afford to commit. Setting this to 1 disables the heuristic and tells the kernel to act like it has infinite memory. Setting this to 2 tells the kernel to not commit to more memory than the total amount of swap space on the system plus a percentage of actual RAM (controlled by vm.overcommit_ratio).
Try tweaking the vm.page-cluster sysctl. This controls how many pages get swapped in or out at a time (it's a base-2 logarithmic value, so the default of 3 translates to 8 pages). If you're actually swapping, this can help improve the performance of swapping pages in and out.

System Info

What I've already tried

Best Answer

Related Solutions

Why are multiple levels of caches used in modern CPUs

Why is writing speed for RAM less than reading speed? And how do caches come into the picture

Related Question