Centos – /proc/sys/kernel/hung_task_timeout_secs error

centoskernellinux-kernelrescuerpm

I am building a CentOS 6.2 webserver on a Hyper-V Cluster and have hit a major issue. The current setup is this:

Operating System: CentOS 6.2. Minimal Server Install
Installed software includes:

  • Hyper-V Linux integration Drivers
  • Webmin
  • Apache 2
  • MySQL 5.*
  • php 5 with APC php accelerator
  • webserver is running Moodle (if it making any difference mentioning it)

The last thing I installed was the "Hyper-V Linux integration Drivers" after which … a few days later the entire OS locked up with "task blocked for more than 120 seconds" error right down the console. I now find this error rears it's ugly head over a prolonged running time and be triggered via a yum update command flagging right after the second download.

Picture of terminal

I've tried fsck as suggested on other forums but to no avail. Most other forums indicate hardware issues as the cause which can not fit here with the Hyper-V cluster (currently running a dozen high demand services with no issues)

EDIT: (should have mentioned this first time around) this error is the last thing I see prior to the whole Kernel locking up. after this the only way to fix it, is to hard re-boot the virtual machine.

Any help solving or diagnosing the problem is much appreciated.

Thank You (in advance)

UPDATE 1:

OK serious escalation here. I've now got 2 other Linux machine of the same cluster with related issue and all of them fall over at the same time. I have a Ubuntu server complaining of a status: {DRDY} and a second CentOS 6.2 server complain of the same error as the first and the first server has crashed in a new way…

Ubuntu DRDY Error
Ubuntu DRDY Error

The First CentOS Kernel Crash
CentOS Kernel Crash

UPDATE 2:
OK we got the cluster fixed but the problem still persisted with the two CentOS VMs. I have fix the first problem centOS by moving it the a un-clusterd Hyper-V and blacklisting all the Linux Integration Drivers followed by completely uninstalling them. it appears that the Linux integration drivers for the guest Machine along a the issues with clustered system causes this issue. the second CentOS machine is not so lucky… even though I have moved it to the un-clustered Hyper-V and blacklisted the drivers it still crashed unfortunately right in the middle of the uninstall of the linux integration driver. now I have the very big issue of the driver package showing in the rpm DB and only being half there.

Question: is there a way of completely removing these drivers without using rpm or yum as now both complain it's there and not there and fails to remove it and trying to do anything too disk heavy like using rpm and yum cause the whole system to crash again. at this rate I may end up completely re-building it, but I need it stabilise soon as it host critical services.

THE ANSWER

OK my final troublesome CentOS VM has now been fixed. to remove the botched Linux Integration Driver and stabilise the system I did the following:

  1. I downloaded a CentOS Live CD and booted it up in the faulty VM.
  2. I then created a new folder with

    $mkdir /mnt/OS
    
  3. then mounted the root filesystem of the faulty OS (which in my case is sda2)

    $mount /dev/sda2 /mnt/OS
    
  4. I then mount the boot partition (which in my case is sda1)

    $mount /dev/sda1 /mnt/OS/boot
    
  5. I then chroot to the fault OS

    $chroot /mnt/OS
    
  6. and then ran rpm remove package command (ensuring all duplicates were removed in the process)

    $rpm -e --allmatches kmod-microsoft-hyper-v
    
  7. this process can take a little time but once done I could reboot the VM and my system was up and running without the integration drivers, but it now allows package updates without crashing.

After this I think I'll leave the Hyper-V drivers out of the mix until I am comfortable that they work and are stable.

Best Answer

You kernel has enabled this task block detecting feature. In most cases, this should be fine since it's not real block. You can just use the command suggested above to disable this.

I've met similar problem before. In kernel, if you use down(semaphore) instead of down_interruptable(semaphore) to wait a semaphore and this semaphore is not up in 120 seconds. It will have that INFO printed. Actually it is fine to have semaphore not released in 120 seconds.

If things work fine except this INFO, just disable this by echo 0.

Related Question