I have a Supermicro X9DR3-F motherboard where JWD
jumper pins 1 and 2 are shorted and watchdog functionality in UEFI is enabled:
This means that the system is reset after around 5 minutes if nothing resets the hardware watchdog timer. I installed the watchdog
daemon and configured it to use iTCO_wdt
driver:
$ cat /etc/default/watchdog
# Start watchdog at boot time? 0 or 1
run_watchdog=1
# Start wd_keepalive after stopping watchdog? 0 or 1
run_wd_keepalive=1
# Load module before starting watchdog
watchdog_module="iTCO_wdt"
# Specify additional watchdog options here (see manpage).
$
When the watchdog
daemon is started, then the driver is loaded without issues:
$ sudo dmesg | grep iTCO_wdt
[ 17.435620] iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11
[ 17.435667] iTCO_wdt: Found a Patsburg TCO device (Version=2, TCOBASE=0x0460)
[ 17.435761] iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0)
$
Also, the /dev/watchdog
file is present:
$ ls -l /dev/watchdog
crw------- 1 root root 10, 130 Dec 8 22:36 /dev/watchdog
$
watchdog-device
option in watchdog
daemon configuration points to this file:
$ grep -v ^# /etc/watchdog.conf
watchdog-device = /dev/watchdog
watchdog-timeout = 60
interval = 5
log-dir = /var/log/watchdog
verbose = yes
realtime = yes
priority = 1
heartbeat-file = /var/log/watchdog/heartbeat
heartbeat-stamps = 1000
$
In order to debug the writes to the watchdog device I have enabled heartbeat-file
option and looks that the keepalive messages to /dev/watchdog
are sent:
$ tail /var/log/watchdog/heartbeat
1575830728
1575830728
1575830728
1575830733
1575830733
1575830733
1575830733
1575830733
1575830733
1575830733
$
However, despite this the server resets itself with roughly five minute intervals.
My next thought was that maybe the iTCO_wdt
driver controls the watchdog in C606 chipset and the watchdog resetting the server is instead part of IPMI. So I made sure that the iTCO_wdt
driver is not loaded during the boot and rebooted the server. Fair enough, the /dev/watchdog
was no longer present. Now I loaded the ipmi_watchdog
module:
$ ls -l /dev/watchdog
ls: cannot access '/dev/watchdog': No such file or directory
$ sudo modprobe ipmi_watchdog
$ sudo dmesg -T | tail -1
[Tue Dec 10 21:12:48 2019] IPMI Watchdog: driver initialized
$ ls -l /dev/watchdog
crw------- 1 root root 10, 130 Dec 10 21:12 /dev/watchdog
$
.. and finally started the watchdog
daemon which based on the /var/log/watchdog/heartbeat
file is writing to /dev/watchdog
with 5s interval. In addition, one can confirm this with strace
:
$ ps -p 2296 -f
UID PID PPID C STIME TTY TIME CMD
root 2296 1 0 01:28 ? 00:00:00 /usr/sbin/watchdog
$ sudo strace -y -p 2296
strace: Process 2296 attached
restart_syscall(<... resuming interrupted nanosleep ...>) = 0
write(1</dev/watchdog>, "\0", 1) = 1
write(1</dev/watchdog>, "\0", 1) = 1
open("/proc/uptime", O_RDONLY) = 2</proc/uptime>
close(2</proc/uptime>) = 0
write(1</dev/watchdog>, "\0", 1) = 1
write(1</dev/watchdog>, "\0", 1) = 1
write(1</dev/watchdog>, "\0", 1) = 1
write(1</dev/watchdog>, "\0", 1) = 1
write(1</dev/watchdog>, "\0", 1) = 1
nanosleep({5, 0}, NULL) = 0
write(1</dev/watchdog>, "\0", 1) = 1
write(1</dev/watchdog>, "\0", 1) = 1
open("/proc/uptime", O_RDONLY) = 2</proc/uptime>
close(2</proc/uptime>) = 0
write(1</dev/watchdog>, "\0", 1) = 1
write(1</dev/watchdog>, "\0", 1) = 1
write(1</dev/watchdog>, "\0", 1) = 1
write(1</dev/watchdog>, "\0", 1) = 1
write(1</dev/watchdog>, "\0", 1) = 1
nanosleep({5, 0}, NULL) = 0
write(1</dev/watchdog>, "\0", 1) = 1
write(1</dev/watchdog>, "\0", 1) = 1
open("/proc/uptime", O_RDONLY) = 2</proc/uptime>
close(2</proc/uptime>) = 0
write(1</dev/watchdog>, "\0", 1) = 1
write(1</dev/watchdog>, "\0", 1) = 1
write(1</dev/watchdog>, "\0", 1) = 1
write(1</dev/watchdog>, "\0", 1) = 1
write(1</dev/watchdog>, "\0", 1) = 1
nanosleep({5, 0}, ^Cstrace: Process 2296 detached
<detached ...>
$
watchdog
daemon above with PID 2296
was started in a way that heartbeat-file
option in /etc/watchdog.conf
was commented out in order to reduce the write
system calls in the output of strace
.
However, the server still reboots with roughly 300s intervals.
Why isn't the watchdog daemon able to reset the hardware watchdog timer on Supermicro X9DR3-F motherboard?
Best Answer
The reason watchdog daemon was not able to reset the hardware watchdog timer on Supermicro X9DR3-F motherboard is that the watchdog functionality in UEFI controls the third watchdog. This is on Winbond Super I/O 83527 chip. In other words,
iTCO_wdt
andipmi_watchdog
drivers were wrong drivers for that watchdog chip.