Bash – How to find why a bash exits with signal 11, Segmentation fault

In my production server running Red Hat Linux (V6) I got frequently core dumps from bash. This occurs from a couple of time a day to dozens of time a day.

TLTR

Resolution: install the bash-debuginfo to get more details from the core and locate the statement which cause the crash.

Cause: in this case it was because of a bug not fixed in my old version of bash lists.gnu.org/archive/html/bug-bash/2010-04/msg00038.html reported in April 2010 against 4.1 and fixed in 4.2 (released in early 2011)

Details
This server runs a single web application (apache + cgi-bin) and many batches.
The webapp cgi (C program) execs system call more than often.

There's not so much shell interaction, so the core dump is probably caused by some service or the webapp and I must know what is causing this error.

The coredump backtrace is a bit dry (see below).

How can I have more details about the error ? I would like to know what is the parent processes chain (fully detailed), the current variables and the env, what was the executed script and/or command…

I have the audit system enabled, but the audit lines about this are a bit dry too. Here is one example:

type=ANOM_ABEND msg=audit(1516626710.805:413350): auid=1313 uid=1313 gid=22107 ses=64579 pid=8655 comm="bash" sig=11

And this is the core backtrace:

    Core was generated by `bash'.
Program terminated with signal 11, Segmentation fault.
#0  0x000000370487b8ec in free () from /lib64/libc.so.6
#0  0x000000370487b8ec in free () from /lib64/libc.so.6
#1  0x000000000044f0b0 in hash_flush ()
#2  0x0000000000458870 in assoc_dispose ()
#3  0x0000000000434f55 in dispose_variable ()
#4  0x000000000044f0a7 in hash_flush ()
#5  0x0000000000433ef3 in pop_var_context ()
#6  0x0000000000434375 in pop_context ()
#7  0x0000000000451fb1 in ?? ()
#8  0x0000000000451c84 in run_unwind_frame ()
#9  0x000000000043200f in ?? ()
#10 0x000000000042fa18 in ?? ()
#11 0x0000000000430463 in execute_command_internal ()
#12 0x000000000046b86b in parse_and_execute ()
#13 0x0000000000444a01 in command_substitute ()
#14 0x000000000044e38e in ?? ()
#15 0x0000000000448d4e in ?? ()
#16 0x000000000044a1b7 in ?? ()
#17 0x0000000000457ac8 in expand_compound_array_assignment ()
#18 0x0000000000445e79 in ?? ()
#19 0x000000000044a264 in ?? ()
#20 0x000000000042ee9f in ?? ()
#21 0x0000000000430463 in execute_command_internal ()
#22 0x000000000043110e in execute_command ()
#23 0x000000000043357e in ?? ()
#24 0x00000000004303bd in execute_command_internal ()
#25 0x0000000000430362 in execute_command_internal ()
#26 0x0000000000432169 in ?? ()
#27 0x000000000042fa18 in ?? ()
#28 0x0000000000430463 in execute_command_internal ()
#29 0x000000000043110e in execute_command ()
#30 0x000000000041d6d6 in reader_loop ()
#31 0x000000000041cebc in main ()
~

Update:
The system is running in a Virual Machine handled by VMWare.

What version of bash?
GNU bash, version 4.1.2(1)-release (x86_64-redhat-linux-gnu)
What version of libc and other libs linked to bash?

ldd (GNU libc) 2.12

(what are the other libs linked to bash ? Is there a command to get the details in a row ?

does this happen while running a script or an interactive shell or both? if script, does it only happen on one script or on several or any? What, in general terms, kind of task is your bash script doing? Do you get seg faults from other processes? Have you run a memory test on your server? Does it have ECC RAM?

as stated in my question: I don't know, but it should be caused by some scheduled scripts or by some system call from inside the interactive webapp.
It could also be a 'script in a script' like in this kind of construct:

myVar=$($(some command here ($and here too))

However I feel that the issue is probably not a physical issue with the RAM, as there's no other random crash, just this one, and we also have it on 2 separates VM running on 2 separate physical machine.

Update 2:

From the stack I have the feeling that maybe the issue can be related to associative arrays:

#1  0x000000000044f0b0 in hash_flush ()
#2  0x0000000000458870 in assoc_dispose ()
#3  0x0000000000434f55 in dispose_variable ()
#4  0x000000000044f0a7 in hash_flush ()

And these kind of variables are in almost all of our custom scripts: there is one main script used a lib that contains common variables and functions for our system.

This script is sourced in almost every one of our scripts.

Best Answer

I installed the debuginfo tools as suggested by gdb and then I got the expression responsible for the crash:

#20 0x0000000000457ac8 in expand_compound_array_assignment (
    var=<value optimized out>, 
    value=0x150c660 "$(logPath \"$@\")", flags=<value optimized out>
)

So now I know what and where is the issue. In my case it was in a function sourced in the .bashrc and the root cause was this wrong redefinition of the map variables in Bash:

declare -A myMap
local myMap=""

...
for key in "${!myMap[@]}"; do 
  echo ${myMap[$key]}
done

This function was called inside a sub-shell which caused the 'segmentation fault' error output to be hidden.

Best Answer

Related Solutions

Segmentation fault (core dumped) – to where? what is it? and why

If other people clean up ...

But what's in there?

Yeah, but I'd like me to be happy instead of gdb

Related Question