I've got a long-running program (becomes a daemon with daemon(3) call) that exits on Signal 11 (Segmentation Violation) every so often. I can't tell why. So, I wrote a SIGSEGV handler, set using the sigaction()
system call. I set the handler function so that it has this prototype: void (*sa_sigaction)(int, siginfo_t *, void *)
which means it gets a pointer to a siginfo_t
structure as a formal argument.
On the occasion of a mysterious SIGSEGV, the si_code
element of the siginfo_t
has a value of 0x80, which means, according to the sigaction man page, "The kernel" sent the signal. This is on a Red Hat RHEL system: Linux blahblah 2.6.18-308.20.1.el5 #1 SMP Tue Nov 6 04:38:29 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
Why does the kernel send a SIGSEGV? Is this from the famed OOM-Killer, or does some other reason exist for getting a SIGSEGV? As a mere user on this system, I can't see /var/log/message
, and the sysadmins are more than a bit aloof, probably because they come from a Windows background.
A SIGSEGV generated on purpose (dereferencing a NULL pointer) does not get an si_code
value of 0x80, it gets 0x1, which means "address not mapped to object".
Best Answer
The undocumented semantic of
si_code = SI_KERNEL
withsi_errno = 0
is,All other
SIGSEGV
s should have asi_errno
set to a non-zero value. Read on for the details.When the kernel sets up a userspace process, it defines a table of virtual memory pages for the process. When the kernel scheduler runs the process, it reconfigures the CPU's memory management unit (MMU) according to the page table for the process.
When a userspace process attempts to access memory that is outside of its page table, the CPU MMU detects this violation and generates an exception. Note that this happens at the hardware level. The kernel is not involved yet.
The kernel is set up to handle MMU exceptions. It catches the exception caused by the running proccess's attempt to access memory outside of its page table. The kernel then calls
do_page_fault()
which sends the SIGSEGV signal to the process. This is why the signal comes from the kernel and not from the process itself or from another process.This is a highly simplified explanation of course. The best simple explanation that I have seen of this is the "Page Faults" section of William Gatliff's beautiful article The Linux Kernel’s Memory Management Unit API.
Note that on CPU's without an MMU, such as the Blackfin MPU's, Linux userspace processes can generally access any memory. i.e. there is no SIGSEGV signal for memory violations (only for traps such as stack overflow) and debugging memory access problems can be tricky.
I second jordanm's comment regarding setting the
ulimit
and inspecting the core file withgdb
. You can doulimit -c unlimited
from the command line if you run the process from a shell, or use the libcsetrlimit
system call wrapper (man setrlimit
) in your program. You can set the name of the core file and its location by in file/proc/sys/kernel/core_pattern
. See A.P. Lawrence's excellent gloss on this at Controlling core files (Linux). To usegdb
on the corefile, see this little tutorial on Steve.org.A segmentation violation with
si_code
SEGV_MAPERR (0x1) is likely a null pointer dereference, an access of non-existent memory such as 0xfffffc0000004000, ormalloc
andfree
problems. Heap corruption or process exceeding its runtime limits (man getrlimit
) in the case ofmalloc
and double free or free of non-allocated address in the case offree
. Look at thesi_errno
element for more clues.A segmentation violation that occurs as a result of userspace process accessing virtual memory above the
TASK_SIZE
limit will cause a segmentation violation with ansi_code
ofSI_KERNEL
. In other words, theTASK_SIZE
limit is the highest virtual address that any process is allowed to access. This is normally 3GB unless the kernel is configured for high memory support. The area above theTASK_SIZE
limit is referred to as the "kernel segment". Seelinux-2.6//arch/x86/mm/fault.c:__bad_area_nosemaphore(...)
where it callsforce_sig_info_fault(...)
.For each architecture there are also a number of specific traps that cause a SISEGV with
SI_KERNEL
. For x86 these are defined by the DO_ERROR macros inlinux-2.6//arch/x86/kernel/traps.c
.The OOM handler sends SIGKILL, not SIGSEGV as can be seen in function
linux-2.6//mm/oom_kill.c:oom_kill_process(...)
at about line 498:for related processes and line 503:
for the process that was the proximal cause of the OOM.
You can get more information by looking at the
wait
status of the process that was killed from its parent process and possibly by looking atdmesg
or better, by configuring the kernel log and looking at it.