Linux – prevent system freeze/unresponsiveness due to swapping run away memory usage

linuxmemoryoomswap

If a process demands a lot of memory, the system moves all other process to the swap file. Including it seems, necessary processes like the X11 server or the terminal.

So if a process keeps allocating without limit, everything becomes unresponsive, till that process is killed by the OOM-killer. My laptop seems to be especially sensible and reacts extremely badly. I just spent an ENTIRE HOUR waiting for the process termination during which not even the mouse cursor could be moved.

How can this be avoided?

1) Disable the swap => I often start a lot of processes that then become inactive. The inactive ones should be moved to the swap.

2) Get an SSD => too expensive

3) set a maximum memory ulimit => but then it fails in cases a program needs a resonable, large amount of memory. the problem is not that it uses too much, but that it suppresses the other processes

4) keep important programs (X11, bash, kill, top, …) in memory and never swap those => can this be done? how? perhaps only swap large programs?

5) ?

Best Answer

TL;DR

Short temporary/answer

  • Easiest: Have a smaller swap partition and avoid the kernel trying to live up to the lie that there is no memory limit by running processes from slow storage.
    • With a big swap, the OOM (out of memory manager) doesn't take action soon enough. Typically, it accounts according to virtual memory and, in my past experience, did't kill things until the entire swap got filled up, hence the thrashing and crawling system...
  • Need a big swap for hibernate?
    • Attempted/problematic: Set some ulimits (e.g. check ulimit -v, and maybe set a hard or soft limit using the as option in limits.conf). This used to work well enough, but thanks to WebKit introducing gigacage, many gnome apps now expect unlimited address spaces and fail to run!
    • Attempted/problematic: The overcommit policy and ratio is another way to try manage and mitigate this (e.g. sysctl vm.overcommit_memory, sysctl vm.overcommit_ratio, but this approach didn't work out for me.
    • Difficult/complicated: Try apply a cgroup priority to the most important processes (e.g. ssh), but this currently seems cumbersome for cgroup v1 (hopefully v2 will make it easier)...

I also found:

  • another stack exchange entry that corroborates the above advice for smaller swap spaces.
  • you could try something like thrash-protect as a work-around for the current situation.

Longer term solution

wait and hope for some upstream patches to get into stable distro kernels. Also hope that distro vendors better tune kernel defaults and better leverage systemd cgroups to prioritise GUI responsiveness in desktop editions.

Some patches of interest:

So it's not just bad user space code and distro config/defaults that's at fault - the kernel could handle this better.

Comments on options already considered

1) Disable the swap

Providing at least a small swap partition is recommended (Do we really need swap on modern systems?). Disabling swap not only prevents swapping out unused pages, but it might also affect the kernel's default heuristic overcommit strategy for allocating memory (What does heuristics in Overcommit_memory =0 mean?), as that heuristic does count on swap pages. Without swap, overcommit can still probably work in the heuristic (0) or always (1) modes, but the combination of no swap and the never (2) overcommit strategy is likely a terrible idea. So in most cases, no swap will likely hurt performance.

E.g., think about a long running process that initially touches memory for once-off work, but then fails to release that memory and keeps running the background. The kernel will have to use RAM for that until the process ends. Without any swap, the kernel can't page it out for something else that actually wants to actively use RAM. Also think about how many devs are lazy and don't explicitly free up memory after use.

3) set a maximum memory ulimit

It only applies per process, and a it's probably a reasonable assumption that a process shouldn’t request more memory that a system physically has! So it's probably useful to stop a lone crazy process from triggering thrashing while still being generously set.

4) keep important programs (X11, bash, kill, top, ...) in memory and never swap those

Nice idea, but then those programs will hog memory they're not actively using. It may be acceptable if the program only requests a modest amount of memory.

systemd 232 release has just added some options that make this possible: I think one could use the 'MemorySwapMax=0' to prevent a unit (service) like ssh having any of it's memory swapped out.

Nonetheless, being able to prioritise memory access would be better.

Long explanation

The linux kernel is more tuned for server workloads, so GUI responsiveness has sadly been a secondary concern... The kernel memory management settings on the Desktop edition of Ubuntu 16.04 LTS didn't seem to differ from the other server editions. It even matches the defaults in RHEL/CentOS 7.2 typically used as a server.

OOM, ulimit and trading off integrity for responsiveness

Swap thrashing (when the working set of memory, i.e. pages being read and writing to in a given short time-frame exceeds the physical RAM) will always lockup storage I/O - no kernel wizardry can save a system from this without killing a process or two...

I'm hoping Linux OOM tweaks coming along in more recent kernels recognise this working set exceeds physical memory situation and kills a process. When it doesn't, the thrashing problem happens. The problem is, with a big swap partition, it can look as if the system still has headroom while the kernel merrily over commits and still serves up memory requests, but the working set could spill over into swap, effectively trying to treat storage as if it's RAM.

On servers, it accepts the performance penalty of thrashing for a determined, slow, don't lose data, trade-off. On desktops, the trade off is different and users would prefer a bit of data loss (process sacrifice) to keep things responsive.

This was a a nice comical analogy about OOM: oom_pardon, aka don't kill my xlock

Incidentally, OOMScoreAdjust is another systemd option to help weight and avoid OOM killing processes considered more important.

buffered writeback

I think "Make background writeback not suck" will help avoid some issues where a process hogging RAM causes another swap out (write to disk) and the bulk write to disk stalls anything else wanting IO. It's not the cause thrashing problem itself, but it does add to the overall degradation in responsiveness.

ulimits limitation

One problem with ulimits is that the accounting an limit applies to the virtual memory address space (which implies combining both physical and swap space). As per man limits.conf:

       rss
          maximum resident set size (KB) (Ignored in Linux 2.4.30 and
          higher)

So setting a ulimit to apply just to physical RAM usage doesn't look usable anymore. Hence

      as
          address space limit (KB)

seems to be the only respected tunable.

Unfortunalty, as detailed more by the example of WebKit/Gnome, some applications can't run if virtual address space allocation is limited.

cgroups should help in future?

Currently, it seems cumbersome, but possible to enable some kernel cgroup flags cgroup_enable=memory swapaccount=1 (e.g. in grub config) and then try use the cgroup memory controller to limit memory use.

cgroups have more advanced memory limit features then the 'ulimit' options. CGroup v2 notes hint at attempts to improve on how ulimits worked.

The combined memory+swap accounting and limiting is replaced by real control over swap space.

CGroup options can be set via systemd resource control options. E.g.:

  • MemoryHigh
  • MemoryMax

Other useful options might be

  • IOWeight
  • CPUShares

These have some drawbacks:

  1. Overhead. Current docker documentation briefly mentions 1% extra memory use and 10% performance degradation (probably with regard to memory allocation operations - it doesn't really specify).
  2. Cgroup/systemd stuff has been heavily re-worked recently, so the flux upstream implies Linux distro vendors might be waiting for it to settle first.

In CGroup v2, they suggest that memory.high should be a good option to throttle and manage memory use by a process group. However this quote suggests that monitoring memory pressure situations needed more work (as of 2015).

A measure of memory pressure - how much the workload is being impacted due to lack of memory - is necessary to determine whether a workload needs more memory; unfortunately, memory pressure monitoring mechanism isn't implemented yet.

Given systemd and cgroup user space tools are complex, I haven't found a simple way to set something appropriate and leverage this further. The cgroup and systemd documentation for Ubuntu isn't great. Future work should be for distro's with desktop editions to leverage cgroups and systemd so that under high memory pressure, ssh and the X-Server/window manager components get higher priority access to CPU, physical RAM and storage IO, to avoid competing with the processes busy swapping. The kernel's CPU and I/O priority features have been around for a while. It seems to be priority access to physical RAM that's lacking.

However, not even CPU and IO priorities are appropriately set!? When I checked the systemd cgroup limits, cpu shares, etc applied, as far as I could tell, Ubuntu hadn't baked in any pre-defined prioritisations. E.g. I ran:

systemctl show dev-mapper-Ubuntu\x2dswap.swap

I compared that to the same output for ssh, samba, gdm and nginx. Important things like the GUI and remote admin console have to fight equally with all other processes when thrashing happens.

Example memory limits I have on a 16GB RAM system

I wanted to enable hibernate, so I needed a big swap partition. Hence attempting to mitigate with ulimits, etc.

ulimit

I put * hard as 16777216 in /etc/security/limits.d/mem.conf such that no single process would be allowed to request more memory than is physically possible. I won't prevent thrashing all together, but without, just a single process with greedy memory use, or a memory leak, can cause thrashing. E.g. I've seen gnome-contacts suck up 8GB+ of memory when doing mundane things like updating the global address list from an exchange server...

Gnome contacts chewing up RAM

As seen with ulimit -S -v, many distros have this hard and soft limit set as 'unlimited' given, in theory, a process could end up requesting lots of memory but only actively using a subset, and run happily thinking it's been given say 24GB of RAM while the system only has 16GB. The above hard limit will cause processes that might have been able to run fine to abort when the kernel denies their greedy speculative memory requests.

However, it also catches insane things like gnome contacts and instead of loosing my desktop responsiveness, I get a "not enough free memory" error:

enter image description here

Complications setting ulimit for address space (virtual memory)

Unfortunately, some developers like to pretend virtual memory is an infinite resource and setting a ulimit on virtual memory can break some apps. E.g. WebKit (which some gnome apps depend on) added a gigacage security feature which tries to allocate insane amounts of virtual memory and FATAL: Could not allocate gigacage memory errors with a cheeky hint Make sure you have not set a virtual memory limit happen. The work-around, GIGACAGE_ENABLED=no forgoes the security benefits, but likewise, not being allowed to limit virtual memory allocation is also forgoes a security feature (e.g. resource control that can prevent denial of service). Ironically, between gigacage and gnome devs, they seem to forget that limiting memory allocation is itself a security control. And sadly, I noticed the gnome apps that rely on gigacage don't bother to explicitly request a higher limit, so even a soft limit breaks things in this case.

To be fair, if the kernel did a better job of being able to deny memory allocation based on resident memory use instead of virtual memory, then pretending virtual memory is unlimited would be less dangerous.

overcommit

If you prefer applications to be denied memory access and want to stop overcommitting, use the commands below to test how your system behaves when under high memory pressure.

In my case, the default commit ratio was:

$ sysctl vm.overcommit_ratio
vm.overcommit_ratio = 50

But it only comes into full effect when changing the policy to disable overcommiting and apply the ratio

sudo sysctl -w vm.overcommit_memory=2

The ratio implied only 24GB of memory could be allocated overall (16GB RAM*0.5 + 16GB SWAP). So I'd probably never see OOM show up, and effectively be less likely to have processes constantly access memory in swap. But I'll also likely sacrifice overall system efficiency.

This will cause many applications to crash, given it's common for devs to not gracefully handle the OS declining a memory allocation request. It trades off the occasional risk of a drawn out lockup due to thrashing (loose all your work after hard reset) to a more frequent risk of various apps crashing. In my testing, it didn't help much because the desktop itself crashed when the system was under memory pressure and it couldn't allocate memory. However, at least consoles and SSH still worked.

How does VM overcommit memory work has more info.

I chose to revert to default for this, sudo sysctl -w vm.overcommit_memory=0, given the whole desktop graphical stack and the applications in it crash nonetheless.

Related Question