Linux – Prevent zram LRU inversion with zswap and max_pool_percent = 100

linuxswapzramzswap

The major disadvantage of using zram is LRU inversion:

older pages get into the higher-priority zram and quickly fill it, while newer pages are swapped in and out of the slower […] swap

The zswap documentation says that zswap does not suffer from this:

Zswap receives pages for compression through the Frontswap API and is able to
evict pages from its own compressed pool on an LRU basis and write them back to
the backing swap device in the case that the compressed pool is full.

Could I have all the benefits of zram and a completely compressed RAM by setting max_pool_percent to 100?

Zswap seeks to be simple in its policies.  Sysfs attributes allow for one user
controlled policy:
* max_pool_percent - The maximum percentage of memory that the compressed
    pool can occupy.

No default max_pool_percent is specified here, but the Arch Wiki page says that it is 20.

Apart from the performance implications of decompressing, is there any danger / downside in setting max_pool_percent to 100?

Would it operate like using an improved swap-backed zram?

Best Answer

To answer your question, I first ran a series of experiments. The final answers are in bold at the end.

Experiments performed:

1) swap file, zswap disabled
2) swap file, zswap enabled, max_pool_percent = 20
3) swap file, zswap enabled, max_pool_percent = 70
4) swap file, zswap enabled, max_pool_percent = 100
5) zram swap, zswap disabled
6) zram swap, zswap enabled, max_pool_percent = 20
7) no swap
8) swap file, zswap enabled, max_pool_percent = 1
9) swap file (300 M), zswap enabled, max_pool_percent = 100

Setup before the experiment:

  • VirtualBox 5.1.30
  • Fedora 27, xfce spin
  • 512 MB RAM, 16 MB video RAM, 2 CPUs
  • linux kernel 4.13.13-300.fc27.x86_64
  • default swappiness value (60)
  • created an empty 512 MB swap file (300 MB in experiment 9) for possible use during some of the experiments (using dd) but didn't swapon yet
  • disabled all dnf* systemd services, ran watch "killall -9 dnf" to be more sure that dnf won't try to auto-update during the experiment or something and throw the results off too far

State before the experiment:

[root@user-vm user]# free -m ; vmstat ; vmstat -d 
              total        used        free      shared  buff/cache   available
Mem:            485         280          72           8         132         153
Swap:           511           0         511
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0  74624   8648 127180    0    0  1377   526  275  428  3  2 94  1  0
disk- ------------reads------------ ------------writes----------- -----IO------
       total merged sectors      ms  total merged sectors      ms    cur    sec
sda   102430    688 3593850   67603   3351   8000 1373336   17275      0     26
sr0        0      0       0       0      0      0       0       0      0      0

The subsequent swapon operations, etc., leading to the different settings during the experiments, resulted in variances of within about 2% of these values.

Experiment operation consisted of:

  • Run Firefox for the first time
  • Wait about 40 seconds or until network and disk activity ceases (whichever is longer)
  • Record the following state after the experiment (firefox left running, except for experiments 7 and 9 where firefox crashed)

State after the experiment:

1) swap file, zswap disabled

[root@user-vm user]# free -m ; vmstat ; vmstat -d 
              total        used        free      shared  buff/cache   available
Mem:            485         287           5          63         192          97
Swap:           511         249         262
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0 255488   5904   1892 195428   63  237  1729   743  335  492  3  2 93  2  0
disk- ------------reads------------ ------------writes----------- -----IO------
       total merged sectors      ms  total merged sectors      ms    cur    sec
sda   134680  10706 4848594   95687   5127  91447 2084176   26205      0     38
sr0        0      0       0       0      0      0       0       0      0      0

2) swap file, zswap enabled, max_pool_percent = 20

[root@user-vm user]# free -m ; vmstat ; vmstat -d 
              total        used        free      shared  buff/cache   available
Mem:            485         330           6          33         148          73
Swap:           511         317         194
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0 325376   7436    756 151144    3  110  1793   609  344  477  3  2 93  2  0
disk- ------------reads------------ ------------writes----------- -----IO------
       total merged sectors      ms  total merged sectors      ms    cur    sec
sda   136046   1320 5150874  117469  10024  41988 1749440   53395      0     40
sr0        0      0       0       0      0      0       0       0      0      0

3) swap file, zswap enabled, max_pool_percent = 70

[root@user-vm user]# free -m ; vmstat ; vmstat -d 
              total        used        free      shared  buff/cache   available
Mem:            485         342           8          32         134          58
Swap:           511         393         118
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0 403208   8116   1088 137180    4    8  3538   474  467  538  3  3 91  3  0
disk- ------------reads------------ ------------writes----------- -----IO------
       total merged sectors      ms  total merged sectors      ms    cur    sec
sda   224321   1414 10910442  220138   7535   9571 1461088   42931      0     60
sr0        0      0       0       0      0      0       0       0      0      0

4) swap file, zswap enabled, max_pool_percent = 100

[root@user-vm user]# free -m ; vmstat ; vmstat -d 
              total        used        free      shared  buff/cache   available
Mem:            485         345          10          32         129          56
Swap:           511         410         101
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0 420712  10916   2316 130520    1   11  3660   492  478  549  3  4 91  2  0
disk- ------------reads------------ ------------writes----------- -----IO------
       total merged sectors      ms  total merged sectors      ms    cur    sec
sda   221920   1214 10922082  169369   8445   9570 1468552   28488      0     56
sr0        0      0       0       0      0      0       0       0      0      0

5) zram swap, zswap disabled

[root@user-vm user]# free -m ; vmstat ; vmstat -d 
              total        used        free      shared  buff/cache   available
Mem:            485         333           4          34         147          72
Swap:           499         314         185
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 5  0 324128   7256   1192 149444  153  365  1658   471  326  457  3  2 93  2  0
disk- ------------reads------------ ------------writes----------- -----IO------
       total merged sectors      ms  total merged sectors      ms    cur    sec
sda   130703    884 5047298  112889   4197   9517 1433832   21037      0     37
sr0        0      0       0       0      0      0       0       0      0      0
zram0  58673      0  469384     271 138745      0 1109960     927      0      1

6) zram swap, zswap enabled, max_pool_percent = 20

[root@user-vm user]# free -m ; vmstat ; vmstat -d 
              total        used        free      shared  buff/cache   available
Mem:            485         338           5          32         141          65
Swap:           499         355         144
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0 364984   7584    904 143572   33  166  2052   437  354  457  3  3 93  2  0
disk- ------------reads------------ ------------writes----------- -----IO------
       total merged sectors      ms  total merged sectors      ms    cur    sec
sda   166168    998 6751610  120911   4383   9543 1436080   18916      0     42
sr0        0      0       0       0      0      0       0       0      0      0
zram0  13819      0  110552      78  68164      0  545312     398      0      0

7) no swap

Note that firefox is not running in this experiment at the time of recording these stats.

[root@user-vm user]# free -m ; vmstat ; vmstat -d 
              total        used        free      shared  buff/cache   available
Mem:            485         289          68           8         127         143
Swap:             0           0           0
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0  70108  10660 119976    0    0 13503   286  607  618  2  5 88  5  0
disk- ------------reads------------ ------------writes----------- -----IO------
       total merged sectors      ms  total merged sectors      ms    cur    sec
sda   748978   3511 66775042  595064   4263   9334 1413728   23421      0    164
sr0        0      0       0       0      0      0       0       0      0      0

8) swap file, zswap enabled, max_pool_percent = 1

[root@user-vm user]# free -m ; vmstat ; vmstat -d 
              total        used        free      shared  buff/cache   available
Mem:            485         292           7          63         186          90
Swap:           511         249         262
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0 255488   7088   2156 188688   43  182  1417   606  298  432  3  2 94  2  0
disk- ------------reads------------ ------------writes----------- -----IO------
       total merged sectors      ms  total merged sectors      ms    cur    sec
sda   132222   9573 4796802  114450  10171  77607 2050032  137961      0     41
sr0        0      0       0       0      0      0       0       0      0      0

9) swap file (300 M), zswap enabled, max_pool_percent = 100

Firefox was stuck and the system still read from disk furiously. The baseline for this experiment is a different since a new swap file has been written:

              total        used        free      shared  buff/cache   available
Mem:            485         280           8           8         196         153
Swap:           299           0         299
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0   8948   3400 198064    0    0  1186   653  249  388  2  2 95  1  0
disk- ------------reads------------ ------------writes----------- -----IO------
       total merged sectors      ms  total merged sectors      ms    cur    sec
sda   103099    688 3610794   68253   3837   8084 1988936   20306      0     27
sr0        0      0       0       0      0      0       0       0      0      0

Specifically, extra 649384 sectors have been written as a result of this change.

State after the experiment:

[root@user-vm user]# free -m ; vmstat ; vmstat -d 
              total        used        free      shared  buff/cache   available
Mem:            485         335          32          47         118          53
Swap:           299         277          22
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 7  1 283540  22912   2712 129132    0    0 83166   414 2387 1951  2 23 62 13  0
disk- ------------reads------------ ------------writes----------- -----IO------
       total merged sectors      ms  total merged sectors      ms    cur    sec
sda   3416602  26605 406297938 4710584   4670   9025 2022272   33805      0    521
sr0        0      0       0       0      0      0       0       0      0      0

Subtracting the extra 649384 written sectors from 2022272 results in 1372888. This is less than 1433000 (see later) which is probably because of firefox not loading fully.

I also ran a few experiments with low swappiness values (10 and 1) and they all got stuck in a frozen state with excessive disk reads, preventing me from recording the final memory stats.

Observations:

  • Subjectively, high max_pool_percent values resulted in sluggishness.
  • Subjectively, the system in experiment 9 was so slow as to be unusable.
  • High max_pool_percent values result in the least amount of writes whereas very low value of max_pool_percent results in the most number of writes.
  • Experiments 5 and 6 (zram swap) suggest that firefox wrote data that resulted in about 62000 sectors written to disk. Anything above about 1433000 are sectors written due to swapping. See the following table.
  • If we assume the lowest number of read sectors among the experiments to be the baseline, we can compare the experiments based on how much extra read sectors due to swapping they caused.

Written sectors as a direct consequence of swapping (approx.):

650000   1) swap file, zswap disabled
320000   2) swap file, zswap enabled, max_pool_percent = 20
 30000   3) swap file, zswap enabled, max_pool_percent = 70
 40000   4) swap file, zswap enabled, max_pool_percent = 100
 0       5) zram swap, zswap disabled
 0       6) zram swap, zswap enabled, max_pool_percent = 20
-20000   7) no swap (firefox crashed)
620000   8) swap file, zswap enabled, max_pool_percent = 1
-60000   9) swap file (300 M), zswap enabled, max_pool_percent = 100 (firefox crashed)

Extra read sectors as a direct consequence of swapping (approx.):

    51792             1) swap file, zswap disabled
   354072             2) swap file, zswap enabled, max_pool_percent = 20
  6113640             3) swap file, zswap enabled, max_pool_percent = 70
  6125280             4) swap file, zswap enabled, max_pool_percent = 100
   250496             5) zram swap, zswap disabled
  1954808             6) zram swap, zswap enabled, max_pool_percent = 20
 61978240             7) no swap
        0 (baseline)  8) swap file, zswap enabled, max_pool_percent = 1
401501136             9) swap file (300 M), zswap enabled, max_pool_percent = 100

Interpretation of results:

  • This is subjective and also specific to the usecase at hand; behavior will vary in other usecases.
  • Zswap's page pool takes away space in RAM that can otherwise be used by system's page cache (for file-backed pages), which means that the system repeatedly throws away file-backed pages and reads them again when needed, resulting in excessive reads.
  • The high number of reads in experiment 7 is caused by the same problem - the system's anonymous pages took most of the RAM and file-backed pages had to be repeatedly read from disk.
  • It might be possible under certain circumstances to minimize the amount of data written to swap disk near zero using zswap but it is evidently not suited for this task.
  • It is not possible to have "completely compressed RAM" as the system needs a certain amount of non-swap pages to reside in RAM for operation.

Personal opinions and anecdotes:

  • The main improvement of zswap in terms of disk writes is not the fact that it compresses the pages but the fact it has its own buffering & caching system that reduces the page cache and effectively keeps more anonymous pages (in compressed form) in RAM. (However, based on my subjective experience as I use Linux daily, a system with swap and zswap with the default values of swappiness and max_pool_percent always behaves better than any swappiness value and no zswap or zswap with high values of max_pool_percent.)
  • Low swappiness values seem to make the system behave better until the amount of page cache left is so small as to render the system unusable due to excessive disk reads. Similar with too high max_pool_percent.
  • Either use solely zram swap and limit the amount of anonymous pages you need to hold in memory, or use disk-backed swap with zswap with approximately default values for swappiness and max_pool_percent.

EDIT: Possible future work to answer the finer points of your question would be to find out for your particular usecase how the the zsmalloc allocator used in zram compares compression-wise with the zbud allocator used in zswap. I'm not going to do that, though, just pointing out things to search for in docs/on the internet.

EDIT 2: echo "zsmalloc" > /sys/module/zswap/parameters/zpool switches zswap's allocator from zbud to zsmalloc. Continuing with my test fixture for the above experiments and comparing zram with zswap+zsmalloc, it seems that as long as the swap memory needed is the same as either a zram swap or as zswap's max_pool_percent, the amount of reads and writes to disk is very similar between the two. In my personal opinion based on the facts, as long as the amount of zram swap I need is smaller than the amount of zram swap I can afford to actually keep in RAM, then it is best to use solely zram; and once I need more swap than I can actually keep in memory, it is best to either change my workload to avoid it or to disable zram swap and use zswap with zsmalloc and set max_pool_percent to the equivalent of what zram previously took in memory (size of zram * compression ratio). I currently don't have the time to do a proper writeup of these additional tests, though.

Related Question