Why Process/program becomes zombie

processzombie-process

If script is running fine from command line then, why the same script becomes zombie state after running through cron and How you will troubleshoot the same ?

Here following real example :

[root@abc ~]# ps ax | grep Z
23880 ?        Zs     0:00 [checkloadadv.sh] <defunct>
23926 pts/0    S+     0:00 grep Z
[root@abc ~]# strace -p 23880
attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted
[root@abc ~]# pstree | grep  checkload
init-+-crond---crond-+-checkloadadv.sh
[root@abc ~]# bash /usr/bin/checkloadadv.sh
System Load is OK : 0.05

Best Answer

enter image description here

Like actual zombie's a zombie process cannot be killed, because it's already dead.

How it happens

When in Linux/Unix a process dies/ends all information from the process gets removed from the system memory, only the process descriptor stays. The process get in the state Z (zombie). His parent process gets a signal from the kernel: SIGCHLD, that means that one of his child processes exits, is interrupted or resumes after being interrupted (in our case it simply exits).

The parent process now needs to execute the wait() syscall to read the exit status and other information from his child process. Then the descriptor gets removed from the memory and the process is no longer a zombie.

If the parent process never calls the wait() syscall, the zombie process descriptor stays in the memory and eats brains. Normally you don't see zombie processes, because the procedure above take less time.

The dawn of the dead

Each process descriptor needs a very small amount of memory, so a few zombies are not very dangerous (like in real life). One problem is that each zombie process keeps his process id, and a Linux/Unix operating system has a limited number of pid's. If an improperly programmed software generates a lot of zombie processes, it can happen that processes cannot be started anymore because no more process id's are available.

So, if they are in huge groups they are very dangerous (like in many movies is demonstrated very well)

How can we defend ourselves against a horde of zombies?

A shot in the head would work, but I don't know the command for that (SIGKILL won't work because the process is already dead).

Well, you can send SIGCHLD via kill to the parent process, but when it ignores this signal, what then? Your only option is to kill the parent process and that the init process "adopts" the zombie. Init calls periodically the wait() syscall to clean up his zombie children.

In your case

In your case, you have to send SIGCHLD to the crond process:

root@host:~# strace -p $(pgrep cron)
Process 1180 attached - interrupt to quit

Then from another terminal:

root@host:~$ kill -17 $(pgrep cron)

The output is:

restart_syscall(<... resuming interrupted call ...>) = ? ERESTART_RESTARTBLOCK (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
wait4(-1, 0x7fff51be39dc, WNOHANG, NULL) = -1 ECHILD (No child processes) <-- Here it happens
rt_sigreturn(0xffffffffffffffff)        = -1 EINTR (Interrupted system call)
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=1892, ...}) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {0x403170, [CHLD], SA_RESTORER|SA_RESTART, 0x7fd6a7e9d4a0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({42, 0}, ^C <unfinished ...>
Process 1180 detached

You see the wait4() syscall returns -1 ECHILD, which means that no child process is there. So the conclusion is: cron reacts to the SIGCHLD syscall and should not force the apocalypse.

Related Solutions

Removing zombie process from the process table

Manipulating the process table and the memory mappings is always the kernel's job. The kernel acts when some process makes a system call. When a process exits, all of the resources that it uses, including memory, except for the entry in the process table, are deleted − that's what the _exit system call does. Then, when the parent process calls wait or waitpid, part of that system call's job is to clean up the process table entry. The parent process may decide to call wait whenever it wants (if the parent is init, it calls wait pretty much all the time).

Way to identify which process turns into Zombie process

The audit subsystem of the Linux kernel can be very useful to figure out what processes are becoming zombie processes. I just had the following situation:

server ~ # ps -ef --forest
[...]
root     16385     1  0 17:04 ?        00:00:00 /usr/sbin/apache2 -k start
root     16388 16385  0 17:04 ?        00:00:00  \_ /usr/bin/perl -T -CSDAL /usr/lib/iserv/apache_user
root     16389 16385  0 17:04 ?        00:00:00  \_ /usr/bin/perl -T -CSDAL /usr/lib/iserv/apache_user
www-data 16415 16385  0 17:04 ?        00:00:00  \_ /usr/sbin/apache2 -k start
www-data 18254 16415  0 17:23 ?        00:00:00  |   \_ [sh] <defunct>
www-data 18347 16415  0 17:23 ?        00:00:00  |   \_ [sh] <defunct>
www-data 22966 16415  0 18:18 ?        00:00:00  |   \_ [sh] <defunct>
www-data 16583 16385  0 17:05 ?        00:00:01  \_ /usr/sbin/apache2 -k start
www-data 18306 16583  0 17:23 ?        00:00:00  |   \_ [sh] <defunct>
www-data 18344 16583  0 17:23 ?        00:00:00  |   \_ [sh] <defunct>
www-data 17561 16385  0 17:12 ?        00:00:00  \_ /usr/sbin/apache2 -k start
www-data 22983 17561  0 18:18 ?        00:00:00  |   \_ [sh] <defunct>
www-data 18318 16385  0 17:23 ?        00:00:00  \_ /usr/sbin/apache2 -k start
www-data 19725 16385  0 17:43 ?        00:00:01  \_ /usr/sbin/apache2 -k start
www-data 22638 16385  0 18:13 ?        00:00:00  \_ /usr/sbin/apache2 -k start
www-data 22659 16385  0 18:14 ?        00:00:00  \_ /usr/sbin/apache2 -k start
www-data 25102 16385  0 18:41 ?        00:00:00  \_ /usr/sbin/apache2 -k start
www-data 25175 16385  0 18:42 ?        00:00:00  \_ /usr/sbin/apache2 -k start
www-data 25272 16385  0 18:44 ?        00:00:00  \_ /usr/sbin/apache2 -k start

The cause for these zombie processes is most probably a PHP script, but as these Apache child processes are processing lots of HTTP requests and lots of different PHP scripts, it's very hard to figure out which one could be responsible. Linux has also already deallocated important information of these zombie processes, so we don't even have /proc/<pid>/cmdline to figure out which script or -c command /bin/sh may have been running:

server ~ # cat /proc/18254/cmdline 
server ~ #

To figure it out, I've installed auditd: https://linux-audit.com/configuring-and-auditing-linux-systems-with-audit-daemon/

I set up the following audit rules:

auditctl -a always,exit -F arch=b32 -S execve -F path=/bin/dash
auditctl -a always,exit -F arch=b64 -S execve -F path=/bin/dash

These rules audit all process creations of the /bin/dash binary. /bin/sh doesn't work here, because it's a symlink and audit apparently only sees the target file name:

server ~ # ls -l /bin/sh
lrwxrwxrwx 1 root root 4 Nov  8  2014 /bin/sh -> dash*

A simple test should now produce audit logs in /var/log/audit/audit.log (I've taken the liberty and added a lot of line breaks to improve the readability):

server ~ # sh -c 'echo test'
test

server ~ # tail -f /var/log/audit/audit.log
[...]
type=SYSCALL msg=audit(1488219335.976:43871): arch=40000003 syscall=11 \
  success=yes exit=0 a0=ffdca3ec a1=f7760e58 a2=ffdd399c a3=ffdca068 items=2 \
  ppid=27771 pid=27800 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 \
  fsgid=0 tty=pts7 ses=7532 comm="sh" exe="/bin/dash" key=(null)
type=EXECVE msg=audit(1488219335.976:43871): argc=3 a0="sh" a1="-c" \
  a2=6563686F2074657374
type=CWD msg=audit(1488219335.976:43871):  \
  cwd="/var/lib/iserv/remote-support/iserv-martin.von.wittich"
type=PATH msg=audit(1488219335.976:43871): item=0 name="/bin/sh" inode=10403900 \
  dev=08:01 mode=0100755 ouid=0 ogid=0 rdev=00:00 nametype=NORMAL
type=PATH msg=audit(1488219335.976:43871): item=1 name=(null) inode=5345368 \
  dev=08:01 mode=0100755 ouid=0 ogid=0 rdev=00:00 nametype=NORMAL
type=PROCTITLE msg=audit(1488219335.976:43871): \
  proctitle=7368002D63006563686F2074657374

Lots of the information is encoded, but ausearch can translate it with -i:

server ~ # ausearch -i -x /bin/dash | tail                                      
[...]
----
type=PROCTITLE msg=audit(27.02.2017 19:15:35.976:43871) : proctitle=sh 
type=PATH msg=audit(27.02.2017 19:15:35.976:43871) : item=1 name=(null) \
  inode=5345368 dev=08:01 mode=file,755 ouid=root ogid=root rdev=00:00 \
  nametype=NORMAL 
type=PATH msg=audit(27.02.2017 19:15:35.976:43871) : item=0 name=/bin/sh \
  inode=10403900 dev=08:01 mode=file,755 ouid=root ogid=root rdev=00:00 \
  nametype=NORMAL 
type=CWD msg=audit(27.02.2017 19:15:35.976:43871) :  \
  cwd=/var/lib/iserv/remote-support/iserv-martin.von.wittich 
type=EXECVE msg=audit(27.02.2017 19:15:35.976:43871) : argc=3 a0=sh a1=-c \
  a2=echo test 
type=SYSCALL msg=audit(27.02.2017 19:15:35.976:43871) : arch=i386 \
  syscall=execve success=yes exit=0 a0=0xffdca3ec a1=0xf7760e58 a2=0xffdd399c \
  a3=0xffdca068 items=2 ppid=27771 pid=27800 auid=root uid=root gid=root \
  euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=pts7 \
  ses=7532 comm=sh exe=/bin/dash key=(null) 
----

If you don't want to restrict the ausearch filtering to /bin/dash, you can also use ausearch -i -m ALL to translate the complete log. Another good filter would be ausearch -i -p <PID of a zombie process>, in this case ausearch -i -p 27800.

Just leave these rules in place until new zombie processes show up, and then search for the process creation of a zombie PID:

ausearch -i -p <PID>

This should be very helpful to identify the root cause of the zombie processes. In my case it was a PHP script that used proc_open to spawn a Perl script without closing the handle with proc_close.