There are 5 processes which can't be killed by kill -9 $PID
and executing cat /proc/$PID/cmdline
will hang the current session. Maybe they're zombie processes.
Executing ps -ef or htop
will also hang the current session. But top
and ps -e
are working fine.
So it seems that there are two problems the filesystem not responding.
This is a production machine running virtual machines, so rebooting isn't an option.
The following processes ids aren't working:
16181 16765 5985 7427 7547
The parent of these processes is init
├─collectd(16765)─┬─{collectd}(16776)
│ ├─{collectd}(16777)
│ ├─{collectd}(16778)
│ ├─{collectd}(16779)
│ ├─{collectd}(16780)
│ └─{collectd}(16781)
├─collectd(28642)───{collectd}(28650)
├─collectd(29868)─┬─{collectd}(29873)
│ ├─{collectd}(29874)
│ ├─{collectd}(29875)
│ └─{collectd}(29876)
And one of the qemu processes not working
|-qemu-system-x86(16181)-+-{qemu-system-x86}(16232)
| |-{qemu-system-x86}(16238)
| |-{qemu-system-x86}(16803)
| |-{qemu-system-x86}(17990)
| |-{qemu-system-x86}(17991)
| |-{qemu-system-x86}(17992)
| |-{qemu-system-x86}(18062)
| |-{qemu-system-x86}(18066)
| |-{qemu-system-x86}(18072)
| |-{qemu-system-x86}(18073)
| |-{qemu-system-x86}(18074)
| |-{qemu-system-x86}(18078)
| |-{qemu-system-x86}(18079)
| |-{qemu-system-x86}(18086)
| |-{qemu-system-x86}(18088)
| |-{qemu-system-x86}(18092)
| |-{qemu-system-x86}(18107)
| |-{qemu-system-x86}(18108)
| |-{qemu-system-x86}(18111)
| |-{qemu-system-x86}(18113)
| |-{qemu-system-x86}(18114)
| |-{qemu-system-x86}(18119)
| |-{qemu-system-x86}(23147)
| `-{qemu-system-x86}(27051)
Best Answer
You don't have zombies.
cat /proc/$PID/cmdline
wouldn't have any problem with a zombie. Ifkill -9
doesn't kill the program, it means the program is doing some uninterruptible I/O operation. That usually indicates one of three things:Utilities such as
ps
may hang if they try to read some information such as the process executable path that the kernel isn't providing for one of the reasons above.Try
cat /proc/16181/syscall
to see what process 16181 is doing. This may or may not work depending on how far gone your system is.If the problem is a network filesystem, you may be able to force-unmount it, or to make it come online. If the problem is a kernel or hardware bug, what you can do will depend on the nature of the bug. Rebooting (and upgrading to a fixed kernel, or replacing the broken hardware) is strongly recommended.