Debian – Watchdog Daemon Unable to Reset Hardware Watchdog Timer on Supermicro X9DR3-F

debianwatchdog

I have a Supermicro X9DR3-F motherboard where JWD jumper pins 1 and 2 are shorted and watchdog functionality in UEFI is enabled:
Supermicro UEFI

This means that the system is reset after around 5 minutes if nothing resets the hardware watchdog timer. I installed the watchdog daemon and configured it to use iTCO_wdt driver:

$ cat /etc/default/watchdog 
# Start watchdog at boot time? 0 or 1
run_watchdog=1
# Start wd_keepalive after stopping watchdog? 0 or 1
run_wd_keepalive=1
# Load module before starting watchdog
watchdog_module="iTCO_wdt"
# Specify additional watchdog options here (see manpage).
$ 

When the watchdog daemon is started, then the driver is loaded without issues:

$ sudo dmesg | grep iTCO_wdt
[   17.435620] iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11
[   17.435667] iTCO_wdt: Found a Patsburg TCO device (Version=2, TCOBASE=0x0460)
[   17.435761] iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0)
$ 

Also, the /dev/watchdog file is present:

$ ls -l /dev/watchdog
crw------- 1 root root 10, 130 Dec  8 22:36 /dev/watchdog
$ 

watchdog-device option in watchdog daemon configuration points to this file:

$ grep -v ^# /etc/watchdog.conf 



watchdog-device    = /dev/watchdog
watchdog-timeout   = 60


interval           = 5
log-dir            = /var/log/watchdog
verbose            = yes
realtime           = yes
priority           = 1

heartbeat-file     = /var/log/watchdog/heartbeat
heartbeat-stamps   = 1000
$ 

In order to debug the writes to the watchdog device I have enabled heartbeat-file option and looks that the keepalive messages to /dev/watchdog are sent:

$ tail /var/log/watchdog/heartbeat
 1575830728
 1575830728
 1575830728
 1575830733
 1575830733
 1575830733
 1575830733
 1575830733
 1575830733
 1575830733
$ 

However, despite this the server resets itself with roughly five minute intervals.

My next thought was that maybe the iTCO_wdt driver controls the watchdog in C606 chipset and the watchdog resetting the server is instead part of IPMI. So I made sure that the iTCO_wdt driver is not loaded during the boot and rebooted the server. Fair enough, the /dev/watchdog was no longer present. Now I loaded the ipmi_watchdog module:

$ ls -l /dev/watchdog
ls: cannot access '/dev/watchdog': No such file or directory
$ sudo modprobe ipmi_watchdog
$ sudo dmesg -T | tail -1
[Tue Dec 10 21:12:48 2019] IPMI Watchdog: driver initialized
$ ls -l /dev/watchdog
crw------- 1 root root 10, 130 Dec 10 21:12 /dev/watchdog
$ 

.. and finally started the watchdog daemon which based on the /var/log/watchdog/heartbeat file is writing to /dev/watchdog with 5s interval. In addition, one can confirm this with strace:

$ ps -p 2296 -f
UID        PID  PPID  C STIME TTY          TIME CMD
root      2296     1  0 01:28 ?        00:00:00 /usr/sbin/watchdog
$ sudo strace -y -p 2296
strace: Process 2296 attached
restart_syscall(<... resuming interrupted nanosleep ...>) = 0
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
open("/proc/uptime", O_RDONLY)          = 2</proc/uptime>
close(2</proc/uptime>)                  = 0
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
nanosleep({5, 0}, NULL)                 = 0
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
open("/proc/uptime", O_RDONLY)          = 2</proc/uptime>
close(2</proc/uptime>)                  = 0
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
nanosleep({5, 0}, NULL)                 = 0
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
open("/proc/uptime", O_RDONLY)          = 2</proc/uptime>
close(2</proc/uptime>)                  = 0
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
nanosleep({5, 0}, ^Cstrace: Process 2296 detached
 <detached ...>
$

watchdog daemon above with PID 2296 was started in a way that heartbeat-file option in /etc/watchdog.conf was commented out in order to reduce the write system calls in the output of strace.

However, the server still reboots with roughly 300s intervals.

Why isn't the watchdog daemon able to reset the hardware watchdog timer on Supermicro X9DR3-F motherboard?

Best Answer

The reason watchdog daemon was not able to reset the hardware watchdog timer on Supermicro X9DR3-F motherboard is that the watchdog functionality in UEFI controls the third watchdog. This is on Winbond Super I/O 83527 chip. In other words, iTCO_wdt and ipmi_watchdog drivers were wrong drivers for that watchdog chip.

Related Question