I'm running a machine that has a GPU running that sometimes causes the machine to freeze. When
I look at syslog file, it says that the kernel is hung:
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
I would like to create a script that detects an activity in the kernel so that when it hangs,
it would boot the machine automatically. However when I run a bash script and keeps track syslog file and looks for some certain keyword, like kernel, the script stops running by the time the kernel freezes, so it doesn't have opportunity to execute reboot command.
Is there a way to keep track kernel activity, so that when it freezes, it automatically reboots? Like auto reboot when kernel panic happens.
regards
Best Answer
Most machines have a
/dev/watchdog
device provided by a kernel driver for some built-in hardware. The user-space api is fairly simple, and there is now also a wdctl command to get information about the hardware features of the device. There is also a systemd configuration optionRuntimeWatchdogSec
to set it at boot.The generic watchdog operation is that the watchdog hardware is configured with an action and a set time delay (some hardware have fixed configurations), it is started, and has to be tickled repeatedly within that delay or it will cause the action, often a reset. Sometimes, on closing the device the watchdog is cleared, but often this is not desirable so the watchdog can be configured to continue timing and triggering no matter what. On reboot, the cause of the reset might be available from the device or some other hardware, so that we can see the watchdog was the cause.