Ubuntu – How to diagnose random freezes

11.04freeze

Ubuntu always seems to freeze in the first ~15 minutes of when it's booted on my machine. Sometimes it's in the first 5 minutes, sometimes it takes 30 minutes, occasionally it never happens…

I can't reproduce it deterministically, but it happens often enough anyway that I probably just wait for it to happen again.

How can I diagnose the freeze to figure out the cause?

Note to close-voters:
No, this is not a duplicate of this question. This question is about diagnosis, not a temporary recovery. The answers on that question only tell me how to kill the X Server, use the Magic Combo to reset the kernel, etc…. which doesn't help me figure out the cause.

Some information:

  1. Ubuntu 11.04: 2.6.38-15-generic #66-Ubuntu SMP x86_64 GNU/Linux

  2. The mouse sometimes moves around, but the UI never responds.

  3. Pressing Ctrl+Alt+F1 to get into a terminal doesn't work.

  4. The Alt+SysRq combos do work… and seem to be the only things that work, aside from the mouse (which sometimes also can move around).

  5. I'm not running out of any resources (many gigabytes of RAM and file system space are free)

  6. Possibly relevant hardware (from the Hardware Lister application):

    • AR9285 Wireless Network Adapter (PCI-Express)

    • GT216 [GeForce GT 330M] (I'm using the Nouveau driver, which seems to work well)

Best Answer

The logs should always be your first port of call. Check syslog for anything untoward:

less /var/log/syslog

Also check the Xserver logs in case there's any indication of a graphics driver problem (although that sounds less likely given your description):

less /var/log/Xorg.0.log

In your particular case, these steps might not throw up anything interesting. In which case, I'd be interested to see what's going on on your system at the time of the problem developing. To that end, personally, I'd set up a temporary log of top output at short intervals - say every 5 or 10 seconds. This should hopefully reveal if a process is running wild with resources at the time of the issue.

Note that alternatives exist, such as switching to another tty with Ctrl+Alt+F1..F6 (to get back to the GUI, it's Ctrl+Alt+F7) and running commands interactively, or configuring a SSH server and logging in remotely. Both of these might be awkward if your machine is moreorless nonresponsive, hence my more awkward suggestion to write a logfile (which could also encounter the same problem, but is more likely to succeed).

It would involve something like this:

while [ 1 -eq 1 ] ; do top -b >> ~/top.log; sleep 10; done

This would write top output to a logfile at ~/top.log every 10 seconds or so. Note that this log would grow quite large if this command is left running for a prolonged period, so keep an eye on it if your machine suddenly starts behaving itself! And remove the log with rm ~/top.log when you're done with it. Note also that executing the above command is a one-time thing; it won't restart itself after a reboot.

To read the logs generated after a crash, you'd use

less ~/top.log

and hit the End key to get to the bottom. You'd be looking for processes with an unusually high %CPU value, or an unusually high RES value.

It may or may not help, but it's handy information to have.

Related Question