Linux – Is “Cached” memory de-facto free

cachelinuxmeminfomemory

When running cat /proc/meminfo, you get these 3 values at the top:

MemTotal:        6291456 kB
MemFree:         4038976 kB
Cached:          1477948 kB

As far as I know, the "Cached" value is disk caches made by the Linux system that will be freed immediately if any application needs more RAM, thus Linux will never run out of memory until both MemFree and Cached are at zero.

Unfortunately, "MemAvailable" is not reported by /proc/meminfo, probably because it is running in a virtual server. (Kernel version is 4.4)

Thus for all practical purposes, the RAM available for applications is MemFree + Cached.

Is that view correct?

Best Answer

That view can be very misleading in a number of real-world cases.

The kernel now provides an estimate for available memory, in the MemAvailable field. This value is significantly different from MemFree + Cached.

/proc/meminfo: provide estimated available memory [kernel change description, 2014]

Many load balancing and workload placing programs check /proc/meminfo to estimate how much free memory is available. They generally do this by adding up "free" and "cached", which was fine ten years ago, but is pretty much guaranteed to be wrong today.

It is wrong because Cached includes memory that is not freeable as page cache, for example shared memory segments, tmpfs, and ramfs, and it does not include reclaimable slab memory, which can take up a large fraction of system memory on mostly idle systems with lots of files.

Currently, the amount of memory that is available for a new workload, without pushing the system into swap, can be estimated from MemFree, Active(file), Inactive(file), and SReclaimable, as well as the "low" watermarks from /proc/zoneinfo. However, this may change in the future, and user space really should not be expected to know kernel internals to come up with an estimate for the amount of free memory. It is more convenient to provide such an estimate in /proc/meminfo. If things change in the future, we only have to change it in one place.
...

Documentation/filesystems/proc.txt:
...
MemAvailable: An estimate of how much memory is available for starting new applications, without swapping. Calculated from MemFree, SReclaimable, the size of the file LRU lists, and the low watermarks in each zone. The estimate takes into account that the system needs some page cache to function well, and that not all reclaimable slab will be reclaimable, due to items being in use. The impact of those factors will vary from system to system.

1. MemAvailable details

As it says above, tmpfs and other Shmem memory cannot be freed, only moved to swap. Cached in /proc/meminfo can be very misleading, due to including this swappable Shmem memory. If you have too many files in a tmpfs, it could be occupying a lot of your memory :-). Shmem can also include some graphics memory allocations, which could be very large.

MemAvailable deliberately does not include swappable memory. Swapping too much can cause long delays. You might even have chosen to run without swap space, or allowed only a relatively limited amount.

I had to double-check how MemAvailable works. At first glance, the code did not seem to mention this distinction.

/*
 * Not all the page cache can be freed, otherwise the system will
 * start swapping. Assume at least half of the page cache, or the
 * low watermark worth of cache, needs to stay.
 */
pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE];
pagecache -= min(pagecache / 2, wmark_low);
available += pagecache;

However, I found it correctly treats Shmem as "used" memory. I created several 1GB files in a tmpfs. Each 1GB increase in Shmem reduced MemAvailable by 1GB. So the size of the "file LRU lists" does not include shared memory or any other swappable memory. (I noticed these same page counts are also used in the code that calculates the "dirty limit").

This MemAvailable calculation also assumes that you want to keep at least enough file cache to equal the kernel's "low watermark". Or half of the current cache - whichever is smaller. (It makes the same assumption for reclaimable slabs as well). The kernel's "low watermark" can be tuned, but it is usually around 2% of system RAM. So if you only want a rough estimate, you can ignore this part :-).

When you are running firefox with around 100MB of program code mapped in the page cache, you generally want to keep that 100MB in RAM :-). Otherwise, at best you will suffer delays, at worst the system will spend all its time thrashing between different applications. So MemAvailable is allowing a small percentage of RAM for this. It might not allow enough, or it might be over-generous. "The impact of those factors will vary from system to system".

For many PC workloads, the point about "lots of files" might not be relevant. Even so, I currently have 500MB reclaimable slab memory on my laptop (out of 8GB of RAM). This is due to ext4_inode_cache (over 300K objects). It happened because I recently had to scan the whole filesystem, to find what was using my disk space :-). I used the command df -x / | sort -n, but e.g. Gnome Disk Usage Analyzer would do the same thing.

2. [edit] Memory in control groups

So-called "Linux containers" are built up from namespaces, cgroups, and various other features according to taste :-). They may provide a convincing enough environment to run something almost like a full Linux system. Hosting services can build containers like this and sell them as "virtual servers" :-).

Hosting servers may also build "virtual servers" using features which are not in mainline Linux. OpenVZ containers pre-date mainline cgroups by two years, and may use "beancounters" to limit memory. So you cannot understand exactly how those memory limits work if you only read documents or ask questions about the mainline Linux kernel. cat /proc/user_beancounters shows current usage and limits. vzubc presents it in a slightly more friendly format. The main page on beancounters documents the row names.

Control groups include the ability to set memory limits on the processes inside them. If you run your application inside such a cgroup, then not all of the system memory will be available to the application :-). So, how can we see the available memory in this case?

The interface for this differs in a number of ways, depending if you use cgroup-v1 or cgroup-v2.

My laptop install uses cgroup-v1. I can run cat /sys/fs/cgroup/memory/memory.stat. The file shows various fields including total_rss, total_cache, total_shmem. shmem, including tmpfs, counts towards the memory limits. I guess you can look at total_rss as an inverse equivalent of MemFree. And there is also the file memory.kmem.usage_in_bytes, representing kernel memory including slabs. (I assume memory.kmem. also includes memory.kmem.tcp. and any future extensions, although this is not documented explicitly). There are not separate counters to view reclaimable slab memory. The document for cgroup-v1 says hitting the memory limits does not trigger reclaim of any slab memory. (The document also has a disclaimer that it is "hopelessly outdated", and that you should check the current source code).

cgroup-v2 is different. I think the root (top-level) cgroup doesn't support memory accounting. cgroup-v2 still has a memory.stat file. All the fields sum over child cgroups, so you don't need to look for total_... fields. There is a file field, which means the same thing cache did. Annoyingly I don't see an overall field like rss inside memory.stat; I guess you would have to add up individual fields. There are separate stats for reclaimable and unreclaimable slab memory; I think a v2 cgroup is designed to reclaim slabs when it starts to run low on memory.

Linux cgroups do not automatically virtualize /proc/meminfo (or any other file in /proc), so that would show the values for the entire machine. This would confuse VPS customers. However it is possible to use namespaces to replace /proc/meminfo with a file faked up by the specific container software. How useful the fake values are, would depend on what that specific software does.

systemd believes cgroup-v1 cannot be securely delegated e.g. to containers. I looked inside a systemd-nspawn container on my cgroup-v1 system. I can see the cgroup it has been placed inside, and the memory accounting on that. On the other hand the contained systemd does not set up the usual per-service cgroups for resource accounting. If memory accounting was not enabled inside this cgroup, I assume the container would not be able to enable it.

I assume if you're inside a cgroup-v2 container, it will look different to the root of a real cgroup-v2 system, and you will be able to see memory accounting for its top-level cgroup. Or if the cgroup you can see does not have memory accounting enabled, hopefully you will be delegated permission so you can enable memory accounting in systemd (or equivalent).

Further investigations

Following my gut feeling that zram was behind this behavious, I setted up a VM with similar spec as your machine: 4 GB RAM and 2 GB zram swap, no swap file.

I have loaded the VM with heavy weight applications and got the following state:

huygens@ubuntu:~$ smem -wt -K ~/vmlinuz-3.2.0-38-generic.unpacked -R 4096M
Area                           Used      Cache   Noncache 
firmware/hardware            130717          0     130717 
kernel image                  13951          0      13951 
kernel dynamic memory       1063520     922172     141348 
userspace memory            2534684     257136    2277548 
free memory                  451432     451432          0 
----------------------------------------------------------
                            4194304    1630740    2563564 
huygens@ubuntu:~$ free -m
             total       used       free     shared    buffers     cached
Mem:          3954       3528        426          0         79        858
-/+ buffers/cache:       2589       1365
Swap:         1977          0       1977

As you can see free reports 858 MB cache memory and that is also what smem seems to report within the cached kernel dynamic memory.

Then I further stressed the system using Chromium Browser. At the beginning, it was only have 83 MB of swap used. But then after a few more tabs opened, the swap switch quickly to almost it's maximum and I experienced OOM! zram has really a dangerous side where wrongly configured (too big sizes) it can quickly hit you back like a trebuchet-like mechanism.

At that time I had the following outputs:

huygens@ubuntu:~$ smem -wt -K ~/vmlinuz-3.2.0-38-generic.unpacked -R 4096M
Area                           Used      Cache   Noncache 
firmware/hardware            130717          0     130717 
kernel image                  13951          0      13951 
kernel dynamic memory       1355344     124072    1231272 
userspace memory             961004      36456     924548 
free memory                 1733288    1733288          0 
----------------------------------------------------------
                            4194304    1893816    2300488 
huygens@ubuntu:~$ free -m
             total       used       free     shared    buffers     cached
Mem:          3954       2256       1698          0          4        132
-/+ buffers/cache:       2118       1835
Swap:         1977       1750        227

See how the kernel dynamic memory (columns cache and non-cache) look like inverted? It is because in the first case, the kernel had "cached" memory such as reported by free but then it had swap memory held by zram which smem does not know how to compute (check smem source code, zram occupation is not reported in /proc/meminfo, this it is not computed by smem which does simple "total kernel mem" - "type of memory reported by meminfo that I know are cache", what it does not know is that in the computed total kernel mem it has added the size of the swap which is in RAM!)

When I was in this state, I activated a hard disk swap and turned off the zram swap and I reset the zram devices: echo 1 > /sys/block/zram0/reset.

After that the noncache kernel memory melted like snow in summer and returned to "normal" value.

Conclusion

smem does not know about zram (yet) maybe because it is still staging and thus not part of /proc/meminfo which reports global parameters (like (in)active pages size, total memory) and then only report on a few specific parameters. smem identified a few of this specific parameters as "cache", sum them up and compare that to total memory. Because of that zram used memory gets counted in the noncache column.

Note: by the way, in modern kernel, meminfo reports also the shared memory consumed. smem does not take that yet into account, so even without zram the output of smem is to consider carefully esp. if you use application that make big use of shared memory.

References used:

Linux – How are /proc/meminfo values calculated

The contents of /proc/meminfo are determined by meminfo_proc_show in fs/proc/meminfo.c in the kernel.

The calculations are all relatively straightforward, but the sources of information used aren’t necessarily so obvious. For example, MemTotal is the totalram value from the sysinfo structure; that’s filled in by si_meminfo in mm/page_alloc.c.

Best Answer

1. MemAvailable details

2. [edit] Memory in control groups

Related Solutions

Linux – “kernel dynamic memory” as reported by smem

Further investigations

Conclusion

Linux – How are /proc/meminfo values calculated

Related Question