Linux Memory Management – Will Linux Start Killing Processes If Memory Gets Short?

killlinuxmemorysegmentation fault

I was running a shell script with commands to run several memory-intensive programs (2-5 GB) back-to-back. When I went back to check on the progress of my script I was surprised to discover that some of my processes were Killed, as my terminal reported to me. Several programs had already successively completed before the programs that were later Killed started, but all the programs afterwards failed in a segmentation fault (which may or may not have been due to a bug in my code, keep reading).

I looked at the usage history of the particular cluster I was using and saw that someone started running several memory-intensive processes at the same time and in doing so exhausted the real memory (and possibly even the swap space) available to the cluster. As best as I can figure, these memory-intensive processes started running about the same time I started having problems with my programs.

Is it possible that Linux killed my programs once it started running out of memory? And is it possible that the segmentation faults I got later on were due to the lack of memory available to run my programs (instead of a bug in my code)?

Best Answer

It can.

There are two different out of memory conditions you can encounter in Linux. Which you encounter depends on the value of sysctl vm.overcommit_memory (/proc/sys/vm/overcommit_memory)

Introduction:
The kernel can perform what is called 'memory overcommit'. This is when the kernel allocates programs more memory than is really present in the system. This is done in the hopes that the programs won't actually use all the memory they allocated, as this is a quite common occurrence.

overcommit_memory = 2

When overcommit_memory is set to 2, the kernel does not perform any overcommit at all. Instead when a program is allocated memory, it is guaranteed access to have that memory. If the system does not have enough free memory to satisfy an allocation request, the kernel will just return a failure for the request. It is up to the program to gracefully handle the situation. If it does not check that the allocation succeeded when it really failed, the application will often encounter a segfault.

In the case of the segfault, you should find a line such as this in the output of dmesg:

[1962.987529] myapp[3303]: segfault at 0 ip 00400559 sp 5bc7b1b0 error 6 in myapp[400000+1000]

The at 0 means that the application tried to access an uninitialized pointer, which can be the result of a failed memory allocation call (but it is not the only way).

overcommit_memory = 0 and 1

When overcommit_memory is set to 0 or 1, overcommit is enabled, and programs are allowed to allocate more memory than is really available.

However, when a program wants to use the memory it was allocated, but the kernel finds that it doesn't actually have enough memory to satisfy it, it needs to get some memory back. It first tries to perform various memory cleanup tasks, such as flushing caches, but if this is not enough it will then terminate a process. This termination is performed by the OOM-Killer. The OOM-Killer looks at the system to see what programs are using what memory, how long they've been running, who's running them, and a number of other factors to determine which one gets killed.

After the process has been killed, the memory it was using is freed up, and the program which just caused the out-of-memory condition now has the memory it needs.

However, even in this mode, programs can still be denied allocation requests. When overcommit_memory is 0, the kernel tries to take a best guess at when it should start denying allocation requests. When it is set to 1, I'm not sure what determination it uses to determine when it should deny a request but it can deny very large requests.

You can see if the OOM-Killer is involved by looking at the output of dmesg, and finding a messages such as:

[11686.043641] Out of memory: Kill process 2603 (flasherav) score 761 or sacrifice child
[11686.043647] Killed process 2603 (flasherav) total-vm:1498536kB, anon-rss:721784kB, file-rss:4228kB
Related Question