Linux – How to Investigate Cause of Total System Hang

arch linuxforensicskernel-paniclinuxlogs

My Arch machine sometimes hangs, suddenly not responding in any way to the mouse or the keyboard. The cursor is frozen. Ctrl-Alt-Backsp won't stop X11, and ctrl-alt-del does exactly nothing. The cpu, network, and disk activity plots in conky and icewm stop updating. In a few minutes the fan turns on. The only way to make the computer do anything at all is to turn off power.

When it boots up, the CPU temperature monitors show 70 to 80C. Before the hang, I was usually doing low-intensity activity like web surfing getting around 50C.

The logs show nothing special compared to a normal shutdown. Memory checker runs fine with zero defects.

How can I investigate why it hung up? Is there extra information I can find for a clue? Is there anything less drastic than power-off to get some kind of action, if only some limited shell or just beeps, but might give a clue?

The machine is a Gateway P6860 17" laptop (bulky but powerful) and it's running Arch 64bit, up to date (as of March 2011). I had Arch for a long time w/o this problem, switched to Ubuntu for about a week then retreated back to a fresh install of Arch. That's when the hangings started.

UPDATE: Yeah, for sure it's overheating. At one temperature, the mouse and keyboard stop working, sometimes becoming functional after several minutes of cooling off. At a higher temperature, worse things happen, like total nonresponsiveness including ignoring SysRq. This condition is shortly followed by a sudden power-off. I have solved the problem by buying a new computer 8D

Best Answer

Frederik's answer involving magic SysRq and kernel dumps will work if the kernel is still running, and not truly hung. The kernel might just be busy-looping for some reason.

The fact that it doesn't respond to Ctrl-Alt-Del tells me that probably isn't the case, and that the machine is locking up hard. That means hardware failure, or something closely related, like a bad driver.

Your memory check test is good, if you let it run long enough. You should also try other things to try and stress the system, like StressLinux. Long-running benchmarks are good, too.

Another thing to try is booting the system with an Ubuntu live CD and trying to use the system as normal. If returning to Ubuntu temporarily like that doesn't cause the problem to recur, there's a good chance it's not actually broken hardware, but one of the related things like a bad driver or incorrectly configured kernel. It is quite possible that a more popular distribution like Ubuntu could have a more stable kernel configuration than one like Arch, simply due to the greater number of machines it's been tried on during the distro's test phase.

Related Question