Linux – What does “task thesqld:xxx blocked for more than 120 seconds” mean

linux-kernelmemoryMySQLprocess

We are troubleshooting a MySQL issue where some queries are taking a very long time complete and I see many of these entry in /var/log/messages:

Jan 28 05:52:15 64455-alpha01 kernel: [2529273.616327] INFO: task mysqld:4123 blocked for more than 120 seconds.
Jan 28 05:52:15 64455-alpha01 kernel: [2529273.616525] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 28 05:52:15 64455-alpha01 kernel: [2529273.616813] mysqld        D  000000000000000d     0  4123   3142 0x00000080

What does it mean? How does it affect that MySQL thread (4123 is the thread id?)

The value in /proc/sys/kernel/hung_task_timeout_secs when I checked now is:

$ cat /proc/sys/kernel/hung_task_timeout_secs
120

I specifically would like to know how does it affect the process?

I read in a forum that it means it happens when that process is holding up too much memory.

Best Answer

echo 0 > /proc/sys/kernel/hung_task_timeout_secs only silences the warning. Besides that it has no effect whatsoever. Any value above zero will cause this message to be issued whenever a task is blocked for that amount of time.

The warning is given to indicate a problem with the system. In my experience it means that the process is blocked in kernel space for at least 120 seconds usually because the process is starved of disk I/O. This can be because of heavy swapping due to too much memory being used, e.g. if you have a heavy webserver load and you've configured too many apache child processes for your system. In your case it may just be that there are too many mysql processes competing for memory and data IO.

It can also happen if the underlying storage system is not performing well, e.g. if you have a SAN which is overloaded, or if there are soft errors on a disk which cause a lot of retries. Whenever a task has to wait long for its IO commands to complete, these warning may be issued.

Related Solutions

linux memory – Why Linux Shows More and Less Memory Than Installed

You should read the dmesg values "Memory Akb/Bkb available" as:

There is A available for use right now, and the system's highest page frame number multiplied by the page size is B.

This is from arch/x86/mm/init_64.c:

printk(KERN_INFO "Memory: %luk/%luk available (%ldk kernel code, "
                 "%ldk absent, %ldk reserved, %ldk data, %ldk init)\n",
                 nr_free_pages() << (PAGE_SHIFT-10),
                 max_pfn << (PAGE_SHIFT-10),
                 codesize >> 10,
                 absent_pages << (PAGE_SHIFT-10),
                 reservedpages << (PAGE_SHIFT-10),
                 datasize >> 10,
                 initsize >> 10);

nr_free_pages() returns the amount of physical memory, managed by the kernel, that is not currently in use. max_pfn is the highest page frame number (the PAGE_SHIFT shift converts that to kb). The highest page frame number can be (much) higher than what you could expect - the memory mapping done by the BIOS can contain holes.
How much these holes take up is tracked by the absent_pages variable, displayed as kB absent. This should explain most of the difference between the second number in the "available" output and your actual, installed RAM.

You can grep for BIOS-e820 in dmesg to "see" these holes. The memory map is displayed there (right at the top of dmesg output after boot). You should be able to see at what physical addresses you have real, usable RAM.
(Other x86 quirks and reserved memory areas probably account for the rest - I don't know the details there.)

MemTotal in /proc/meminfo indicates RAM available for use. Right at the end of the boot sequence, the kernel frees init data it doesn't need any more, so the value reported in /proc/meminfo could be a bit higher than what the kernel prints out during the initial parts of the boot sequence.

(meminfo uses indirectly totalram_pages for that display. For x86_64, this is calculated in arch/x86/mm/init_64.c too via free_all_bootmem() which itself is in mm/bootmem.c for non-NUMA kernels.)

Linux – What does “INFO: task XXX blocked for more than 120 seconds” exactly mean on Linux

If a task is blocked, it waits for resources to become available again.

In your case there was propably either a IO-problem or a contention in the disk-area. Or your system-load was so high that there was not enough CPU-power available to finish the job in time.

I have seen this error from cron, if it tries to start a job in a very busy time.

Best Answer

Related Solutions

linux memory – Why Linux Shows More and Less Memory Than Installed

Linux – What does “INFO: task XXX blocked for more than 120 seconds” exactly mean on Linux

Related Question