Linux – Find the cause of a permanently-blocked I/O (process in uninterruptible sleep)

debuggingiolinuxprocess

Under Linux, I have a process that is blocked in uninterruptible sleep (state D). How can I investigate what's causing this?

I am running an “ordinary” kernel (a Debian build), without any special debugging features.

There is no relevant log entry — in fact nothing got logged between the time the process started and the time I noticed it.

strace can't even attach to the process since it's in uninterruptible sleep. And even if I knew what system call was called, that wouldn't necessarily help me. I need to know what's going on inside the kernel.

Specifically, the sync command goes into uninterruptible sleep 🙁 So I must have an I/O problem somewhere but all my filesystems appear to work normally. There may well be an old log entry about an I/O error but I can't find it (this machine hasn't rebooted in a long time, that's a lot of log entries). Can I at least know which subsystem is blocking sync? For example, get a kernel backtrace for the kernel thread corresponding to a particular PID/TID?

(I'm sure that rebooting would either fix this or reveal the error but I'm asking how to investigate this, not how to blindly press a button.)

Best Answer

It's a bit late, but it could be helpful for others.

What I did:

  1. cat /proc/PID/stack to get some direction. In my case it was connected with inode and filesystem:
[<ffffffff83bbd6f1>] wait_on_page_bit+0x81/0xa0            
[<ffffffff83bced9b>] truncate_inode_pages_range+0x42b/0x750
[<ffffffff83bcf12f>] truncate_inode_pages_final+0x4f/0x60  
[<ffffffff83c6b78c>] evict+0x16c/0x180                     
[<ffffffff83c6bafc>] iput+0xfc/0x190                       
[<ffffffff83c66498>] __dentry_kill+0x158/0x1d0             
[<ffffffff83c66b35>] dput+0xb5/0x1a0                       
[<ffffffff83c4f53d>] __fput+0x18d/0x230                    
[<ffffffff83c4f6ce>] ____fput+0xe/0x10                     
[<ffffffff83ac31cb>] task_work_run+0xbb/0xe0               
[<ffffffff83a2cc65>] do_notify_resume+0xa5/0xc0            
[<ffffffff8419322f>] int_signal+0x12/0x17                  
[<ffffffffffffffff>] 0xffffffffffffffff                    
  1. cat /proc/PID/syscall to get current system call:
3 0x6 0x1ae4bc6d 0x1 0x559320c 0x801df5 0x60161c4e 0x7ffccee38ae0 0x7fcf1a1547bd

3 stands for close syscall, 6 is file descriptor (first argument of syscall). It was trying to call close(6).

  1. lsof -p PID, but there wasn't my descriptor.
  2. If you are lucky and your file is open somewhere at the start of your application, you can start another instance of it and inspect this file by lsof. It was my case.

Goodluck

Related Question