Linux – Make or force tmpfs to swap before the file cache

cachelinuxswaptmpfszram

Consider the following scenario. You have a slow read-only media (e.g. write-protected Thumb Drive, CD/DVD, whatever) that you installed Linux on (not a Live CD per se, but a normal build), and use it on a computer with literally no other forms of storage. It's slow, because it is USB 2. The root filesystem is mounted as overlayfs so that it's "writeable" for logs and a lot of other temporary work you do, but all writes go to RAM (tmpfs upperdir). Pretty typical scenario for a Live distro situation.

Since there is no other forms of storage, swap is mounted on zram. So when Linux decides to swap, it compresses those pages and stores them still in RAM, but at least they're compressed. This is actually decent, since the RAM of most applications is easily compressible (RAM is usually very redundant in data since it's meant to be "fast"). This works well for application memory, but not for tmpfs.

Here's the thing: zram is fast, incredibly so. The Thumb Drive, on the other hand, is slow. Let's say it's 20 MiB/s, which is really slow in comparison. You can see the problem and why the kernel will not do the right thing here.

Note that this question is not a duplication of How to make files inside TMPFS more likely to swap. The question is pretty much the same, but I'm not satisfied with that answer in that question whatsoever, sorry. The kernel definitely does not do the "right thing" by itself, regardless of how smart the people designing it are. I dislike it when people don't understand the situation and think they know better. They cater to the average case. That's why Linux is so tweakable, because no matter how smart it is, it can't predict what it will be used for.

For example, I can (and did) set vm.swappiness (/proc/sys/vm/swappiness) to 100, which tells it to swap application memory aggressively and keep the file cache. This option is nice, but it's not everything, unfortunately.

I want it to prioritize keeping the file cache over any other RAM use when dealing with swap. That's because dropping the file cache results in it having to read back from the slow 20 MiB/s drive, which is much much slower than swapping to zram. For applications, the vm.swappiness works, but not for tmpfs.

tmpfs is mounted as page cache, so it has the same priority as the file cache. If you read a file from tmpfs, it will prioritize it over an older file cache entry (most recently used). But that's bad, the kernel clearly does not do the right thing here. It should consider that swapping tmpfs to zram is much better even if it's "used more recently" than the file cache because reading from the drive is very slow.

So I need to explicitly tell it to swap from tmpfs more often compared to the file cache: that it should preserve the file cache more than tmpfs. There are so many options in /proc/sys/vm but nothing for this that I could find. Disappointing really.

Failing that, is there a way to tell the kernel that some devices/drives are just that much more slower than others, and it should prefer to keep the cache for them more than others? tmpfs and zram are fast. The thumb drive is not. Can I tell the kernel this information?

It can't do "the right thing" by itself if it treats all drives the same. It's much faster to swap tmpfs to a fast drive like zram than to drop caches from a slow drive, even if tmpfs is used more recently.

When it runs out of free memory it will start to either swap application memory (good) due to swappiness, or drop old file caches (bad). If I end up re-reading from those files, it will be very slow. Much slower than if it decided to swap some tmpfs, even if recently used, and then read from it again. Because zram is an order of magnitude faster.

Best Answer

Increasing the swappiness value makes the kernel more willing to swap tmpfs pages, and less willing to evict cached pages from the other filesystems which are not backed by swap.

Since zram swap is faster than your thumb drive, you ideally want to increase swappiness above 100. This is only possible in kernel version 5.8 or above. Linux 5.8 allows swappiness to be set to a maximum of 200.

For in-memory swap, like zram or zswap, [...] values beyond 100 can be considered. For example, if the random IO against the swap device is on average 2x faster than IO from the filesystem, swappiness should be 133 (x + 2x = 200, 2x = 133.33).

-- Documentation/admin-guide/sysctl/vm.rst


Further reading

tmpfs is treated the same as any other swappable memory

See the kernel commit "vmscan: split LRU lists into anon & file sets" -

Split the LRU lists in two, one set for pages that are backed by real file systems ("file") and one for pages that are backed by memory and swap ("anon"). The latter includes tmpfs.

- and the code at linux-4.16/mm/vmscan.c:2108 -

/*
 * Determine how aggressively the anon and file LRU lists should be
 * scanned.  The relative value of each set of LRU lists is determined
 * by looking at the fraction of the pages scanned we did rotate back
 * onto the active list instead of evict.
 *
 * nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan
 * nr[2] = file inactive pages to scan; nr[3] = file active pages to scan
 */
static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
               struct scan_control *sc, unsigned long *nr,
               unsigned long *lru_pages)
{
    int swappiness = mem_cgroup_swappiness(memcg);

...

    /*
     * With swappiness at 100, anonymous and file have the same priority.
     * This scanning priority is essentially the inverse of IO cost.
     */
    anon_prio = swappiness;
    file_prio = 200 - anon_prio;

Linux 5.8 allows swappiness values up to 200

mm: allow swappiness that prefers reclaiming anon over the file workingset

With the advent of fast random IO devices (SSDs, PMEM) and in-memory swap devices such as zswap, it's possible for swap to be much faster than filesystems, and for swapping to be preferable over thrashing filesystem caches.

Allow setting swappiness - which defines the rough relative IO cost of cache misses between page cache and swap-backed pages - to reflect such situations by making the swap-preferred range configurable.

This was part of a series of patches in Linux 5.8. In previous versions, Linux "mostly goes for page cache and defers swapping until the VM is under significant memory pressure". This is because "the high seek cost of rotational drives under which the algorithm evolved also meant that mistakes could quickly result in lockups from too aggressive swapping (which is predominantly random IO)."

This series sets out to address this. Since commit ("a528910e12ec mm: thrash detection-based file cache sizing") we have exact tracking of refault IO - the ultimate cost of reclaiming the wrong pages. This allows us to use an IO cost based balancing model that is more aggressive about scanning anonymous memory when the cache is thrashing, while being able to avoid unnecessary swap storms.

These patches base the LRU balance on the rate of refaults on each list, times the relative IO cost between swap device and filesystem (swappiness), in order to optimize reclaim for least IO cost incurred.

-- [PATCH 00/14] mm: balance LRU lists based on relative thrashing v2