Linux – Process which locks up, ignores SIGKILL, is runnable (not a zombie or in uninterruptable sleep). What state is it in

linux-kernelprocessredis

I have a process which several times now has stopped responding and appears to be completely locking up. It doesn't respond to any attempt at strace or peeking with gdb (gdb just hangs on a wait4() syscall). The process is runnable, and is not waiting on a syscall (/proc/X/syscall: running) or in uninterruptable sleep (/proc/X/status: State: R (running)).

What state is this process in exactly? Is this possibly a kernel bug of some type?

The process is redis, and this has happened a few times now. Only thing that can kill the process is a reboot, it seems. OS is Cent 7.

Edit: Kernel version is 3.10.0-123.13.2.el7.x86_64. Trying an update to 3.10.0-229.11.1.el7 to see if that makes any difference.

Best Answer

wait4 is a syscall indicating the process is waiting for one of his child termination. This may points some issue with the signal handling.

A bit brutal, but you may try to kill the hierarchy of the app : kill -15 -$YourRedisPID. The - before the PID means "the PID and its children". As it seems to be waiting for a child termination, it may unlock it.

If it's not working, let's check deeper : find your signal process status with grep ^Sig /proc/$YourRedisPID/status

You'll see some stuff like :

SigQ:   8/62777
SigPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000080
SigCgt: 0000000180004023

As defined in "fs/proc/array.c" of the kernel source, the "SigQ" is the number of signals pending / the limit of pending signals.

If the number of signal is too high, it may indicate your "SIGKILL" is not handled at all. I'm still checking the "kernel/signal.c" file to understand the signal management of these special signals.

For a direct understanding of the output, try this one-liner : awk 'BEGIN{print "ibase=16;obase=2;"} /^Sig...:/{ print toupper($2)}' /proc/$YourRedisPID/status | BC_LINE_LENGTH=0 bc

This outputs me :

0
0
10000000
110000000000000000100000000100011

Let's start by sending us this output. I'll update the post as required.

Related Question