Linux – Is “Cached” memory de-facto free

cachelinuxmeminfomemory

When running cat /proc/meminfo, you get these 3 values at the top:

MemTotal:        6291456 kB
MemFree:         4038976 kB
Cached:          1477948 kB

As far as I know, the "Cached" value is disk caches made by the Linux system that will be freed immediately if any application needs more RAM, thus Linux will never run out of memory until both MemFree and Cached are at zero.

Unfortunately, "MemAvailable" is not reported by /proc/meminfo, probably because it is running in a virtual server. (Kernel version is 4.4)

Thus for all practical purposes, the RAM available for applications is MemFree + Cached.

Is that view correct?

Best Answer

That view can be very misleading in a number of real-world cases.

The kernel now provides an estimate for available memory, in the MemAvailable field. This value is significantly different from MemFree + Cached.

/proc/meminfo: provide estimated available memory [kernel change description, 2014]

Many load balancing and workload placing programs check /proc/meminfo to estimate how much free memory is available. They generally do this by adding up "free" and "cached", which was fine ten years ago, but is pretty much guaranteed to be wrong today.

It is wrong because Cached includes memory that is not freeable as page cache, for example shared memory segments, tmpfs, and ramfs, and it does not include reclaimable slab memory, which can take up a large fraction of system memory on mostly idle systems with lots of files.

Currently, the amount of memory that is available for a new workload, without pushing the system into swap, can be estimated from MemFree, Active(file), Inactive(file), and SReclaimable, as well as the "low" watermarks from /proc/zoneinfo. However, this may change in the future, and user space really should not be expected to know kernel internals to come up with an estimate for the amount of free memory. It is more convenient to provide such an estimate in /proc/meminfo. If things change in the future, we only have to change it in one place.
...

Documentation/filesystems/proc.txt:
...
MemAvailable: An estimate of how much memory is available for starting new applications, without swapping. Calculated from MemFree, SReclaimable, the size of the file LRU lists, and the low watermarks in each zone. The estimate takes into account that the system needs some page cache to function well, and that not all reclaimable slab will be reclaimable, due to items being in use. The impact of those factors will vary from system to system.

1. MemAvailable details

As it says above, tmpfs and other Shmem memory cannot be freed, only moved to swap. Cached in /proc/meminfo can be very misleading, due to including this swappable Shmem memory. If you have too many files in a tmpfs, it could be occupying a lot of your memory :-). Shmem can also include some graphics memory allocations, which could be very large.

MemAvailable deliberately does not include swappable memory. Swapping too much can cause long delays. You might even have chosen to run without swap space, or allowed only a relatively limited amount.

I had to double-check how MemAvailable works. At first glance, the code did not seem to mention this distinction.

/*
 * Not all the page cache can be freed, otherwise the system will
 * start swapping. Assume at least half of the page cache, or the
 * low watermark worth of cache, needs to stay.
 */
pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE];
pagecache -= min(pagecache / 2, wmark_low);
available += pagecache;

However, I found it correctly treats Shmem as "used" memory. I created several 1GB files in a tmpfs. Each 1GB increase in Shmem reduced MemAvailable by 1GB. So the size of the "file LRU lists" does not include shared memory or any other swappable memory. (I noticed these same page counts are also used in the code that calculates the "dirty limit").

This MemAvailable calculation also assumes that you want to keep at least enough file cache to equal the kernel's "low watermark". Or half of the current cache - whichever is smaller. (It makes the same assumption for reclaimable slabs as well). The kernel's "low watermark" can be tuned, but it is usually around 2% of system RAM. So if you only want a rough estimate, you can ignore this part :-).

When you are running firefox with around 100MB of program code mapped in the page cache, you generally want to keep that 100MB in RAM :-). Otherwise, at best you will suffer delays, at worst the system will spend all its time thrashing between different applications. So MemAvailable is allowing a small percentage of RAM for this. It might not allow enough, or it might be over-generous. "The impact of those factors will vary from system to system".

For many PC workloads, the point about "lots of files" might not be relevant. Even so, I currently have 500MB reclaimable slab memory on my laptop (out of 8GB of RAM). This is due to ext4_inode_cache (over 300K objects). It happened because I recently had to scan the whole filesystem, to find what was using my disk space :-). I used the command df -x / | sort -n, but e.g. Gnome Disk Usage Analyzer would do the same thing.

2. [edit] Memory in control groups

So-called "Linux containers" are built up from namespaces, cgroups, and various other features according to taste :-). They may provide a convincing enough environment to run something almost like a full Linux system. Hosting services can build containers like this and sell them as "virtual servers" :-).

Hosting servers may also build "virtual servers" using features which are not in mainline Linux. OpenVZ containers pre-date mainline cgroups by two years, and may use "beancounters" to limit memory. So you cannot understand exactly how those memory limits work if you only read documents or ask questions about the mainline Linux kernel. cat /proc/user_beancounters shows current usage and limits. vzubc presents it in a slightly more friendly format. The main page on beancounters documents the row names.

Control groups include the ability to set memory limits on the processes inside them. If you run your application inside such a cgroup, then not all of the system memory will be available to the application :-). So, how can we see the available memory in this case?

The interface for this differs in a number of ways, depending if you use cgroup-v1 or cgroup-v2.

My laptop install uses cgroup-v1. I can run cat /sys/fs/cgroup/memory/memory.stat. The file shows various fields including total_rss, total_cache, total_shmem. shmem, including tmpfs, counts towards the memory limits. I guess you can look at total_rss as an inverse equivalent of MemFree. And there is also the file memory.kmem.usage_in_bytes, representing kernel memory including slabs. (I assume memory.kmem. also includes memory.kmem.tcp. and any future extensions, although this is not documented explicitly). There are not separate counters to view reclaimable slab memory. The document for cgroup-v1 says hitting the memory limits does not trigger reclaim of any slab memory. (The document also has a disclaimer that it is "hopelessly outdated", and that you should check the current source code).

cgroup-v2 is different. I think the root (top-level) cgroup doesn't support memory accounting. cgroup-v2 still has a memory.stat file. All the fields sum over child cgroups, so you don't need to look for total_... fields. There is a file field, which means the same thing cache did. Annoyingly I don't see an overall field like rss inside memory.stat; I guess you would have to add up individual fields. There are separate stats for reclaimable and unreclaimable slab memory; I think a v2 cgroup is designed to reclaim slabs when it starts to run low on memory.

Linux cgroups do not automatically virtualize /proc/meminfo (or any other file in /proc), so that would show the values for the entire machine. This would confuse VPS customers. However it is possible to use namespaces to replace /proc/meminfo with a file faked up by the specific container software. How useful the fake values are, would depend on what that specific software does.

systemd believes cgroup-v1 cannot be securely delegated e.g. to containers. I looked inside a systemd-nspawn container on my cgroup-v1 system. I can see the cgroup it has been placed inside, and the memory accounting on that. On the other hand the contained systemd does not set up the usual per-service cgroups for resource accounting. If memory accounting was not enabled inside this cgroup, I assume the container would not be able to enable it.

I assume if you're inside a cgroup-v2 container, it will look different to the root of a real cgroup-v2 system, and you will be able to see memory accounting for its top-level cgroup. Or if the cgroup you can see does not have memory accounting enabled, hopefully you will be delegated permission so you can enable memory accounting in systemd (or equivalent).

Related Question