Linux – Can Linux “run out of RAM”

kernellinuxramswapvirtual-memory

I saw several posts around the web of people apparently complaining about a hosted VPS unexpectedly killing processes because they used too much RAM.

How is this possible? I thought all modern OS' provide "infinite RAM" by just using disk swap for whatever goes over the physical RAM. Is this correct?

What might be happening if a process is "killed due to low RAM"?

Best Answer

What might be happening if a process is "killed due to low RAM"?

It's sometimes said that linux by default never denies requests for more memory from application code -- e.g. malloc().¹ This is not in fact true; the default uses a heuristic whereby

Obvious overcommits of address space are refused. Used for a typical system. It ensures a seriously wild allocation fails while allowing overcommit to reduce swap usage.

From [linux_src]/Documentation/vm/overcommit-accounting (all quotes are from the 3.11 tree). Exactly what counts as a "seriously wild allocation" isn't made explicit, so we would have to go through the source to determine the details. We could also use the experimental method in footnote 2 (below) to try and get some reflection of the heuristic -- based on that, my initial empirical observation is that under ideal circumstances (== the system is idle), if you don't have any swap, you'll be allowed to allocate about half your RAM, and if you do have swap, you'll get about half your RAM plus all of your swap. That is more or less per process (but note this limit is dynamic and subject to change because of state, see some observations in footnote 5).

Half your RAM plus swap is explicitly the default for the "CommitLimit" field in /proc/meminfo. Here's what it means -- and note it actually has nothing to do with the limit just discussed (from [src]/Documentation/filesystems/proc.txt):

CommitLimit: Based on the overcommit ratio ('vm.overcommit_ratio'), this is the total amount of memory currently available to be allocated on the system. This limit is only adhered to if strict overcommit accounting is enabled (mode 2 in 'vm.overcommit_memory'). The CommitLimit is calculated with the following formula: CommitLimit = ('vm.overcommit_ratio' * Physical RAM) + Swap For example, on a system with 1G of physical RAM and 7G of swap with a 'vm.overcommit_ratio' of 30 it would yield a CommitLimit of 7.3G.

The previously quoted overcommit-accounting doc states that the default vm.overcommit_ratio is 50. So if you sysctl vm.overcommit_memory=2, you can then adjust vm.covercommit_ratio (with sysctl) and see the consequences.³ The default mode, when CommitLimit is not enforced and only "obvious overcommits of address space are refused", is when vm.overcommit_memory=0.

While the default strategy does have a heuristic per-process limit preventing the "seriously wild allocation", it does leave the system as a whole free to get seriously wild, allocation wise.⁴ This means at some point it can run out of memory and have to declare bankruptcy to some process(es) via the OOM killer.

What does the OOM killer kill? Not necessarily the process that asked for memory when there was none, since that's not necessarily the truly guilty process, and more importantly, not necessarily the one that will most quickly get the system out of the problem it is in.

This is cited from here which probably cites a 2.6.x source:

/*
 * oom_badness - calculate a numeric value for how bad this task has been
 *
 * The formula used is relatively simple and documented inline in the
 * function. The main rationale is that we want to select a good task
 * to kill when we run out of memory.
 *
 * Good in this context means that:
 * 1) we lose the minimum amount of work done
 * 2) we recover a large amount of memory
 * 3) we don't kill anything innocent of eating tons of memory
 * 4) we want to kill the minimum amount of processes (one)
 * 5) we try to kill the process the user expects us to kill, this
 *    algorithm has been meticulously tuned to meet the principle
 *    of least surprise ... (be careful when you change it)
 */

Which seems like a decent rationale. However, without getting forensic, #5 (which is redundant of #1) seems like a tough sell implementation wise, and #3 is redundant of #2. So it might make sense to consider this pared down to #2/3 and #4.

I grepped through a recent source (3.11) and noticed that this comment has changed in the interim:

/**
 * oom_badness - heuristic function to determine which candidate task to kill
 *
 * The heuristic for determining which task to kill is made to be as simple and
 * predictable as possible.  The goal is to return the highest value for the
 * task consuming the most memory to avoid subsequent oom failures.
 */

This is a little more explicitly about #2: "The goal is to [kill] the task consuming the most memory to avoid subsequent oom failures," and by implication #4 ("we want to kill the minimum amount of processes (one)).

If you want to see the OOM killer in action, see footnote 5.

¹ A delusion Gilles thankfully rid me of, see comments.

² Here's a straightforward bit of C which asks for increasingly large chunks of memory to determine when a request for more will fail:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

#define MB 1 << 20

int main (void) {
    uint64_t bytes = MB;
    void *p = malloc(bytes);
    while (p) {
        fprintf (stderr,
            "%lu kB allocated.\n",
            bytes / 1024
        );
        free(p);
        bytes += MB;
        p = malloc(bytes);
    }
    fprintf (stderr,
        "Failed at %lu kB.\n",
        bytes / 1024
    );
    return 0;
}

If you don't know C, you can compile this gcc virtlimitcheck.c -o virtlimitcheck, then run ./virtlimitcheck. It is completely harmless, as the process doesn't use any of the space it asks for -- i.e., it never really uses any RAM.

On a 3.11 x86_64 system with 4 GB system and 6 GB of swap, I failed at ~7400000 kB; the number fluctuates, so perhaps state is a factor. This is coincidentally close to the CommitLimit in /proc/meminfo, but modifying this via vm.overcommit_ratio does not make any difference. On a 3.6.11 32-bit ARM 448 MB system with 64 MB of swap, however, I fail at ~230 MB. This is interesting since in the first case the amount is almost double the amount of RAM, whereas in the second it is about 1/4 that -- strongly implying the amount of swap is a factor. This was confirmed by turning swap off on the first system, when the failure threshold went down to ~1.95 GB, a very similar ratio to the little ARM box.

But is this really per process? It appears to be. The short program below asks for a user defined chunk of memory, and if it succeeds, waits for you to hit return -- this way you can try multiple simultaneous instances:

#include <stdio.h>
#include <stdlib.h>

#define MB 1 << 20

int main (int argc, const char *argv[]) {
    unsigned long int megabytes = strtoul(argv[1], NULL, 10);
    void *p = malloc(megabytes * MB);
    fprintf(stderr,"Allocating %lu MB...", megabytes);
    if (!p) fprintf(stderr,"fail.");
    else {
        fprintf(stderr,"success.");
        getchar();
        free(p);
    }
    return 0;
}

Beware, however, that it is not strictly about the amount of RAM and swap regardless of use -- see footnote 5 for observations about the effects of system state.

³ CommitLimit refers to the amount of address space allowed for the system when vm.overcommit_memory = 2. Presumably then, the amount you can allocate should be that minus what's already committed, which is apparently the Committed_AS field.

A potentially interesting experiment demonstrating this is to add #include <unistd.h> to the top of virtlimitcheck.c (see footnote 2), and a fork() right before the while() loop. That is not guaranteed to work as described here without some tedious synchronization, but there is a decent chance it will, YMMV:

> sysctl vm.overcommit_memory=2
vm.overcommit_memory = 2
> cat /proc/meminfo | grep Commit
CommitLimit:     9231660 kB
Committed_AS:    3141440 kB
> ./virtlimitcheck 2&> tmp.txt
> cat tmp.txt | grep Failed
Failed at 3051520 kB.
Failed at 6099968 kB.

This makes sense -- looking at tmp.txt in detail you can see the processes alternate their bigger and bigger allocations (this is easier if you throw the pid into the output) until one, evidently, has claimed enough that the other one fails. The winner is then free to grab everything up to CommitLimit minus Committed_AS.

⁴ It's worth mentioning, at this point, if you do not already understand virtual addressing and demand paging, that what makes over commitment possible in the first place is that what the kernel allocates to userland processes isn't physical memory at all -- it's virtual address space. For example, if a process reserves 10 MB for something, that's laid out as a sequence of (virtual) addresses, but those addresses do not yet correspond to physical memory. When such an address is accessed, this results in a page fault and then the kernel attempts to map it onto real memory so that it can store a real value. Processes usually reserve much more virtual space than they actually access, which allows the kernel to make the most efficient use of RAM. However, physical memory is still a finite resource and when all of it has been mapped to virtual address space, some virtual address space has to be eliminated to free up some RAM.

⁵ First a warning: If you try this with vm.overcommit_memory=0, make sure you save your work first and close any critical applications, because the system will be frozen for ~90 seconds and some process will die!

The idea is to run a fork bomb that times out after 90 seconds, with the forks allocating space and some of them writing large amounts of data to RAM, all the while reporting to stderr.

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/time.h>
#include <errno.h>
#include <string.h>

/* 90 second "Verbose hungry fork bomb".
Verbose -> It jabbers.
Hungry -> It grabs address space, and it tries to eat memory.

BEWARE: ON A SYSTEM WITH 'vm.overcommit_memory=0', THIS WILL FREEZE EVERYTHING
FOR THE DURATION AND CAUSE THE OOM KILLER TO BE INVOKED.  CLOSE THINGS YOU CARE
ABOUT BEFORE RUNNING THIS. */

#define STEP 1 << 30 // 1 GB
#define DURATION 90

time_t now () {
    struct timeval t;
    if (gettimeofday(&t, NULL) == -1) {
        fprintf(stderr,"gettimeofday() fail: %s\n", strerror(errno));
        return 0;
    }
    return t.tv_sec;
}

int main (void) {
    int forks = 0;
    int i;
    unsigned char *p;
    pid_t pid, self;
    time_t check;
    const time_t start = now();
    if (!start) return 1;

    while (1) {
    // Get our pid and check the elapsed time.
        self = getpid();
        check = now();
        if (!check || check - start > DURATION) return 0;
        fprintf(stderr,"%d says %d forks\n", self, forks++);
    // Fork; the child should get its correct pid.
        pid = fork();
        if (!pid) self = getpid();
    // Allocate a big chunk of space.
        p = malloc(STEP);
        if (!p) {
            fprintf(stderr, "%d Allocation failed!\n", self);
            return 0;
        }
        fprintf(stderr,"%d Allocation succeeded.\n", self);
    // The child will attempt to use the allocated space.  Using only
    // the child allows the fork bomb to proceed properly.
        if (!pid) {
            for (i = 0; i < STEP; i++) p[i] = i % 256;
            fprintf(stderr,"%d WROTE 1 GB\n", self);
        }
    }
}

Compile this gcc forkbomb.c -o forkbomb. First, try it with sysctl vm.overcommit_memory=2 -- you'll probably get something like:

6520 says 0 forks
6520 Allocation succeeded.
6520 says 1 forks
6520 Allocation succeeded.
6520 says 2 forks
6521 Allocation succeeded.
6520 Allocation succeeded.
6520 says 3 forks
6520 Allocation failed!
6522 Allocation succeeded.

In this environment, this kind of fork bomb doesn't get very far. Note that the number in "says N forks" is not the total number of processes, it is the number of processes in the chain/branch leading up to that one.

Now try it with vm.overcommit_memory=0. If you redirect stderr to a file, you can do some crude analysis afterward, e.g.:

> cat tmp.txt | grep failed
4641 Allocation failed!
4646 Allocation failed!
4642 Allocation failed!
4647 Allocation failed!
4649 Allocation failed!
4644 Allocation failed!
4643 Allocation failed!
4648 Allocation failed!
4669 Allocation failed!
4696 Allocation failed!
4695 Allocation failed!
4716 Allocation failed!
4721 Allocation failed!

Only 15 processes failed to allocate 1 GB -- demonstrating that the heuristic for overcommit_memory = 0 is affected by state. How many processes were there? Looking at the end of tmp.txt, probably > 100,000. Now how may actually got to use the 1 GB?

> cat tmp.txt | grep WROTE
4646 WROTE 1 GB
4648 WROTE 1 GB
4671 WROTE 1 GB
4687 WROTE 1 GB
4694 WROTE 1 GB
4696 WROTE 1 GB
4716 WROTE 1 GB
4721 WROTE 1 GB

Eight -- which again makes sense, since at the time I had ~3 GB RAM free and 6 GB of swap.

Have a look at your system logs after you do this. You should see the OOM killer reporting scores (amongst other things); presumably this relates to oom_badness.

Best Answer

Related Solutions

Is Swap Space Still Relevant? – Understanding Modern Memory Management

How much RAM can an application allocate on 64-bit x86 Linux systems

Related Question