Linux – Why can’t we kill uninterruptible D state process

linux-kernelprocessprocess-management

I have often issues with processes stuck in D state, due to NFS shares behind firewalls. If I lose connections, processes get stuck in D state and I can't kill them. The only solution becomes hard reboot. I was wondering if there are any other ways but all the solutions and information I can find is "you just can't kill it". Everyone seems to be fine with and accept the way it is. I am a bit critical about this. I thought that there must be a way to scrape off the process from the memory so that there is no need for reboot. It is very annoying if this happens often. And if the resource happens to return the IO, it can simply be ignored in this case. Why isn't this possible? Linux kernel is IMHO very advanced and you should be able to do things like this. Especially, in servers…

I could not find a satisfying answer, why isn't/can't this be implemented?

I would also be interested in answers regarding programming and of algorithmic nature, which would explain this issue.

Best Answer

Killing a process while it's in a system call is possible, and it mostly works. What's difficult is to make it work all the time. Going from 99.99% to 100% is the difficult part.

Normally, when a process is killed, all the resources that it uses are freed. If there's any I/O going on with the process, the code doing this I/O is notified and it exits, allowing the resources that it's using to be freed.

Uninterruptible sleep happens visibly when “the code is notified and it exits” takes a non-negligible amount of time. This means that the code isn't working as it should. It's a bug. Yes, it's theoretically possible to write code without bugs, but it's practically impossible.

You say “if the resource happens to return the IO, it can simply be ignored”. Well, fine. But suppose for example that a peripheral has been programmed to write to memory belonging to the process. To kill the process without cancelling the request to the peripheral, the memory must be kept in use somehow. You can't just get rid of that resource. There are resources that must stay around. And freeing the other resources can only be done if the kernel knows which resources are safe to free, which requires the code to be written in such a way that it's always possible to tell. The cases when uninterruptible sleep lasts for a visible amount of time are cases where it's impossible to tell, and the only safe thing is to way.

It is possible to design an operating system where killing a process is guaranteed to work (under certain assumptions about the hardware working correctly). For example a hard real time operating systems guarantee that killing a process takes at most a certain fixed amount of time (assuming that it offers a kill facility at all). But it's difficult, especially if the operating system must also support a wide range of peripheral and offer good common-case performance. Linux favors common-case behavior over worst-case behavior in many ways.

Getting all the code paths covered is extremely difficult, especially when there wasn't a stringent framework for doing so from day 1. In the grand scheme of things, unkillable processes are extremely rare (you don't notice when they don't happen). It is a symptom of buggy drivers. A finite amount of effort has been put in writing Linux drivers. To eliminate more cases of prolonged uninterruptible sleep would either require more people on the task, or lead to less supported hardware and worse performance.