Debugging Unexpected System Shutdown

physical-environmentUbuntu

NOTE:

Original post is truncated to be inline with the spirit of a StackExchange post (An answer to a question). However, the the journaling I was doing is still valuable as I describe my process:

I've archived the last "Journal State" of this post in my blog here: https://erotemic.wordpress.com/2021/10/01/debugging-unexpected-system-shutdown-initial-archive/ and in the future I will post to the blog to provide updates, and I will edit this SuperUser question to have the core questions and descriptions of the experiments that resolved them.

In the meantime I've removed all but my latest update. I imagine this entire post will be drastically reorganized.


The Symptom

My machine is experiencing a hard shutdown when running a custom Pytorch script.

I've taken three videos demonstrating the issue:

https://www.youtube.com/watch?v=Ue4XHcusqto

https://www.youtube.com/watch?v=LPwaI1SRlXk

https://www.youtube.com/watch?v=yQ7i-8Kp6xg


Summary of Debugging Steps and Results

  • Measured wattage as shutdown occurred, well within limits, wattage likelihood drastically lowered.
  • Measured thermals as shutdown occurred, well within limits on CPU / GPU, no serious anomalies, thermal likelihood drastically lowered.
  • Ran MemTest86+ overnight: All tests pass. Likelihood of issue being bad RAM is effectively ruled out.
  • Swapped 1600W PSU with a 1000W PSU with the same model. Shutdown still occurred. Likelihood of issue being a bad PSU is effectively ruled out.
  • Ran with only 1080ti in PCIE slot #1 and #3, shutdown still occurs in both cases. Likelihood of the 3090 being bad is drastically lowered.
  • Ran with only 3090 connected in PCIE slot #1, shutdown still occurs. Likelihood of the 1080ti being bad is drastically lowered.
  • Ran different ML scripts, shutdown did not occur, probability of my custom ML script containing the issue is increased.
  • Ran large stress tests on the CPU / GPU, poweroff only seems to occur with my custom ML script.

Culprits effectively ruled out

  • Thermals
  • Wattage
  • PSU
  • GPU
  • RAM

Potential Culprits and TODO:

  • Bisect custom ML script to find MWE that causes shutdown
  • Motherboard issue?
  • CPU issue?
  • Storage issue? (unlikely)

Potential Solution!?

I've updated my blog with more info. The gist, is I found a BIOS setting: ASUS MultiCore Enhancement: Auto and setting it to Disabled, seems to resolve the issue. I ran my experiments for over 14 hours without a poweroff.

Parts of the Original Post: To Be Reorganized


I'm attempting to debug a recurring unexpected system shutdown, which sometimes occurs when the machine is under load, but I cannot get it to happen reliably. My current hypothesis are:

  • drawing too much power from the wall
  • thermal issues
  • undiscovered hardware issue

The Hardware + Software + Workload

A list of hardware on my machine can be found here: https://pcpartpicker.com/user/erotemic/saved/#view=WKpmD3

The relevant bits are:

  • CPU: Intel i9-11900K with a Noctua NH-d15 Air Cooler
  • GPU0: RTX 3090 MSI Trio (attached to monitors)
  • GPU1: ~GTX 1080ti~ Upgraded to a second RTX 3090 EVGA XC3 Hybrid.
  • PSU: EVGA T2 1600 W 80+ Titanium

I'm running stock Ubuntu 21.04

There are a few different workloads I will stress the machine with.

  • ethermine – which used both GPUs.
  • BOINC – with climateprediction.net and World Community Grid (set to use 90% of the CPU as long as the machine isn't in use)
  • Custom Machine Learning workflows with PyTorch.

I haven't been using ethermine recently, I've been running my ML workloads.

The Wattage Hypothesis

I've measured the wattage of system, and it consumes rougly 700-800 Watts as measured by a Kill-O-Watt P3 (this includes the monitors and everything else plugged into the surge protector). I'm living in an old American building that was converted into appartments. So, I'm not 100% sure about the capacity of the circuit, but assuming everything is up to code (which I'm not convinced it is) the circuit should be able to accomidate 1800 watts. Other electronics in the room are a 10 Watt lamp and a 989 Watt AC. So this brushes right up against the 1800 Watt limit. At first I was convinced this must be the culprit, but one night when it was cooler, I started a workload and unplugged the AC, and in the morning the power off, so this hypothesis no longer explains the symptoms well.

Additionally, I thought my cheap "Quirky Pivot Power" surge protector might be an issue, so I've ordred a Tripp Lite ISOBAR6Ultra, which hopefully will be higher quality, but it has not arrived yet, and I don't think that is the problem.

The Thermal Hypothesis

I'm currently leaning more towards the thermals being the problem, but when I search the logs I don't see any information related to thermal related shutdowns.

I've been using psensor to monitor temperatures and dump logs to disk every 300 seconds (so the logged temps might not include the high temperature that caused the shutdown).

I've graphed the temperatures that were logged around the most recent shutdown, which occured around 2021-08-18 at 3:00am:

enter image description here

Note, that I was intentionally not using the RTX 3090 here in an effort to prevent such issues, but it seems even the 1080ti runs will trip whatever condition is causing this shtudown.

The CPU logs a max temperature here of 93C, but I've seen the temperature record maximum temperatures of close to 99C, and the critical temperature as reproted by "sensors" is 100C. So, given that the CPU temperatures are increasing right before the shutdown occured, and the logging interval is every 5 minutes, it could very well be the case that the system hit a critical temperature and shutdown before the next log occurred.

But I'm still unsatisfied with this. First running journalctl -g 'temperature|critical' -b -2 as suggested in https://unix.stackexchange.com/questions/502226/how-do-you-find-out-if-a-linux-machine-overheated-before-the-previous-boot-and-w gives no indication that the system logged a temperature issue.

The result of journalctl -b -1

Aug 18 02:46:57 toothbrush smartd[1857]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 84 to 71
Aug 18 02:46:57 toothbrush smartd[1857]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 84 to 72
Aug 18 02:46:57 toothbrush smartd[1857]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 20 to 19
Aug 18 02:46:57 toothbrush smartd[1857]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 83 to 74
Aug 18 02:46:57 toothbrush smartd[1857]: Device: /dev/sdc [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 20 to 23
Aug 18 02:46:57 toothbrush smartd[1857]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 67 to 77
Aug 18 02:46:57 toothbrush smartd[1857]: Device: /dev/sdd [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 27 to 23
Aug 18 02:47:00 toothbrush boinc[3170]: 18-Aug-2021 02:47:00 [---] Suspending computation - CPU is busy
Aug 18 02:47:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r344465819 t8087795, 64bit:1), syncing.
Aug 18 02:47:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r398433847 t7744494, 64bit:1), syncing.
Aug 18 02:47:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r349747452 t8371229, 64bit:1), syncing.
Aug 18 02:48:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r390229100 t7980049, 64bit:1), syncing.
Aug 18 02:48:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r352409333 t7226854, 64bit:1), syncing.
Aug 18 02:48:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r508920330 t10538384, 64bit:1), syncing.
Aug 18 02:48:50 toothbrush boinc[3170]: 18-Aug-2021 02:48:50 [---] Resuming computation
Aug 18 02:49:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r261199946 t4894398, 64bit:1), syncing.
Aug 18 02:49:01 toothbrush boinc[3170]: 18-Aug-2021 02:49:01 [---] Suspending computation - CPU is busy
Aug 18 02:49:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r251680223 t6509690, 64bit:1), syncing.
Aug 18 02:49:21 toothbrush boinc[3170]: 18-Aug-2021 02:49:21 [---] Resuming computation
Aug 18 02:49:31 toothbrush boinc[3170]: 18-Aug-2021 02:49:31 [---] Suspending computation - CPU is busy
Aug 18 02:49:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r346528983 t5840449, 64bit:1), syncing.
Aug 18 02:50:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r560923145 t12173867, 64bit:1), syncing.
Aug 18 02:50:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r567474866 t11497897, 64bit:1), syncing.
Aug 18 02:50:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r519892497 t10585216, 64bit:1), syncing.
Aug 18 02:51:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r553040012 t11503711, 64bit:1), syncing.
Aug 18 02:51:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r489967052 t11999909, 64bit:1), syncing.
Aug 18 02:51:31 toothbrush boinc[3170]: 18-Aug-2021 02:51:31 [---] Resuming computation
Aug 18 02:51:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r279491189 t4690385, 64bit:1), syncing.
Aug 18 02:51:41 toothbrush boinc[3170]: 18-Aug-2021 02:51:41 [---] Suspending computation - CPU is busy
Aug 18 02:52:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r233899151 t4847426, 64bit:1), syncing.
Aug 18 02:52:01 toothbrush boinc[3170]: 18-Aug-2021 02:52:01 [---] Resuming computation
Aug 18 02:52:11 toothbrush boinc[3170]: 18-Aug-2021 02:52:11 [---] Suspending computation - CPU is busy
Aug 18 02:52:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r268957755 t5537306, 64bit:1), syncing.
Aug 18 02:52:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r333913668 t7187733, 64bit:1), syncing.
Aug 18 02:53:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r450294755 t8957939, 64bit:1), syncing.
Aug 18 02:53:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r264028304 t5582071, 64bit:1), syncing.
Aug 18 02:53:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r379501357 t8308167, 64bit:1), syncing.
Aug 18 02:54:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r364408338 t9670592, 64bit:1), syncing.
Aug 18 02:54:12 toothbrush boinc[3170]: 18-Aug-2021 02:54:12 [---] Resuming computation
Aug 18 02:54:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r410359086 t6437227, 64bit:1), syncing.
Aug 18 02:54:22 toothbrush boinc[3170]: 18-Aug-2021 02:54:22 [---] Suspending computation - CPU is busy
Aug 18 02:54:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r266936223 t4903133, 64bit:1), syncing.
Aug 18 02:54:42 toothbrush boinc[3170]: 18-Aug-2021 02:54:42 [---] Resuming computation
Aug 18 02:54:52 toothbrush boinc[3170]: 18-Aug-2021 02:54:52 [---] Suspending computation - CPU is busy
Aug 18 02:55:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r258961514 t5642594, 64bit:1), syncing.
Aug 18 02:55:01 toothbrush CRON[313877]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 18 02:55:01 toothbrush CRON[313878]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 18 02:55:01 toothbrush CRON[313877]: pam_unix(cron:session): session closed for user root
Aug 18 02:55:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r485119089 t10059003, 64bit:1), syncing.
Aug 18 02:55:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r482961424 t9750792, 64bit:1), syncing.
Aug 18 02:56:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r334697691 t7035018, 64bit:1), syncing.
Aug 18 02:56:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r449591310 t9490996, 64bit:1), syncing.
Aug 18 02:56:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r415820654 t10568703, 64bit:1), syncing.
Aug 18 02:56:43 toothbrush boinc[3170]: 18-Aug-2021 02:56:43 [---] Resuming computation
Aug 18 02:56:53 toothbrush boinc[3170]: 18-Aug-2021 02:56:53 [---] Suspending computation - CPU is busy
Aug 18 02:57:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r326675026 t4890602, 64bit:1), syncing.
Aug 18 02:57:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r461180383 t10357149, 64bit:1), syncing.
Aug 18 02:57:23 toothbrush boinc[3170]: 18-Aug-2021 02:57:23 [---] Resuming computation
Aug 18 02:57:33 toothbrush boinc[3170]: 18-Aug-2021 02:57:33 [---] Suspending computation - CPU is busy
Aug 18 02:57:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r311496530 t5584467, 64bit:1), syncing.
Aug 18 02:58:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r345401175 t6977056, 64bit:1), syncing.
Aug 18 02:58:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r413257951 t8468887, 64bit:1), syncing.
Aug 18 02:58:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r430901546 t9350168, 64bit:1), syncing.
Aug 18 02:59:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r316409469 t6532987, 64bit:1), syncing.
Aug 18 02:59:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r496502797 t11915940, 64bit:1), syncing.
Aug 18 02:59:24 toothbrush boinc[3170]: 18-Aug-2021 02:59:24 [---] Resuming computation

The result of cat /var/log/syslog near the shutdown is:

Aug 18 02:52:11 toothbrush boinc[3170]: 18-Aug-2021 02:52:11 [---] Suspending computation - CPU is busy
Aug 18 02:52:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r268957755 t5537306, 64bit:1), syncing.
Aug 18 02:52:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r333913668 t7187733, 64bit:1), syncing.
Aug 18 02:53:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r450294755 t8957939, 64bit:1), syncing.
Aug 18 02:53:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r264028304 t5582071, 64bit:1), syncing.
Aug 18 02:53:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r379501357 t8308167, 64bit:1), syncing.
Aug 18 02:54:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r364408338 t9670592, 64bit:1), syncing.
Aug 18 02:54:12 toothbrush boinc[3170]: 18-Aug-2021 02:54:12 [---] Resuming computation
Aug 18 02:54:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r410359086 t6437227, 64bit:1), syncing.
Aug 18 02:54:22 toothbrush boinc[3170]: 18-Aug-2021 02:54:22 [---] Suspending computation - CPU is busy
Aug 18 02:54:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r266936223 t4903133, 64bit:1), syncing.
Aug 18 02:54:42 toothbrush boinc[3170]: 18-Aug-2021 02:54:42 [---] Resuming computation
Aug 18 02:54:52 toothbrush boinc[3170]: 18-Aug-2021 02:54:52 [---] Suspending computation - CPU is busy
Aug 18 02:55:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r258961514 t5642594, 64bit:1), syncing.
Aug 18 02:55:01 toothbrush CRON[313878]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 18 02:55:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r485119089 t10059003, 64bit:1), syncing.
Aug 18 02:55:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r482961424 t9750792, 64bit:1), syncing.
Aug 18 02:56:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r334697691 t7035018, 64bit:1), syncing.
Aug 18 02:56:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r449591310 t9490996, 64bit:1), syncing.
Aug 18 02:56:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r415820654 t10568703, 64bit:1), syncing.
Aug 18 02:56:43 toothbrush boinc[3170]: 18-Aug-2021 02:56:43 [---] Resuming computation
Aug 18 02:56:53 toothbrush boinc[3170]: 18-Aug-2021 02:56:53 [---] Suspending computation - CPU is busy
Aug 18 02:57:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r326675026 t4890602, 64bit:1), syncing.
Aug 18 02:57:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r461180383 t10357149, 64bit:1), syncing.
Aug 18 02:57:23 toothbrush boinc[3170]: 18-Aug-2021 02:57:23 [---] Resuming computation
Aug 18 02:57:33 toothbrush boinc[3170]: 18-Aug-2021 02:57:33 [---] Suspending computation - CPU is busy
Aug 18 02:57:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r311496530 t5584467, 64bit:1), syncing.
Aug 18 02:58:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r345401175 t6977056, 64bit:1), syncing.
Aug 18 02:58:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r413257951 t8468887, 64bit:1), syncing.
Aug 18 02:58:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r430901546 t9350168, 64bit:1), syncing.
Aug 18 02:59:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r316409469 t6532987, 64bit:1), syncing.
Aug 18 02:59:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r496502797 t11915940, 64bit:1), syncing.
Aug 18 02:59:24 toothbrush boinc[3170]: 18-Aug-2021 02:59:24 [---] Resuming computation
Aug 18 09:25:52 toothbrush systemd-modules-load[472]: Inserted module 'lp'
Aug 18 09:25:52 toothbrush systemd-modules-load[472]: Inserted module 'ppdev'
Aug 18 09:25:52 toothbrush systemd-modules-load[472]: Inserted module 'parport_pc'
Aug 18 09:25:52 toothbrush systemd-modules-load[472]: Inserted module 'msr'
Aug 18 09:25:52 toothbrush kernel: [    0.000000] microcode: microcode updated early to revision 0x40, date = 2021-04-11
Aug 18 09:25:52 toothbrush lvm[461]:   2 logical volume(s) in volume group "vgubuntu" monitored
Aug 18 09:25:52 toothbrush kernel: [    0.000000] Linux version 5.11.0-25-generic (buildd@lgw01-amd64-044) (gcc (Ubuntu 10.3.0-1ubuntu1) 10.3.0, GNU ld (GNU Binutils for Ubuntu) 2.36.1) #27-Ubuntu SMP Fri Jul 9 23:06:29 UTC 2021 (Ubuntu 5.11.0-25.27-generic 5.11.22)
Aug 18 09:25:52 toothbrush kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.11.0-25-generic root=/dev/mapper/vgubuntu-root ro quiet splash vt.handoff=7
Aug 18 09:25:52 toothbrush kernel: [    0.000000] KERNEL supported cpus:
Aug 18 09:25:52 toothbrush systemd[1]: Starting Flush Journal to Persistent Storage...
Aug 18 09:25:52 toothbrush kernel: [    0.000000]   Intel GenuineIntel
Aug 18 09:25:52 toothbrush kernel: [    0.000000]   AMD AuthenticAMD
Aug 18 09:25:52 toothbrush kernel: [    0.000000]   Hygon HygonGenuine
Aug 18 09:25:52 toothbrush kernel: [    0.000000]   Centaur CentaurHauls
Aug 18 09:25:52 toothbrush kernel: [    0.000000]   zhaoxin   Shanghai  
Aug 18 09:25:52 toothbrush systemd[1]: Finished Load Kernel Modules.

Something of interest here is that the last log before the shutdown was Aug 18 02:59:24 toothbrush boinc[3170]: 18-Aug-2021 02:59:24 [---] Resuming computation, indicating the BOINC was about to start running a CPU intensive process.

Running cat /var/log/kern.log and looking in the nearby time gives less info:

Aug 17 23:47:21 toothbrush kernel: [100858.782842] pcieport 0000:00:01.0: AER: Corrected error received: 0000:00:01.0
Aug 17 23:47:21 toothbrush kernel: [100858.782850] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Aug 17 23:47:21 toothbrush kernel: [100858.782851] pcieport 0000:00:01.0:   device [8086:4c01] error status/mask=00000001/00002000
Aug 17 23:47:21 toothbrush kernel: [100858.782852] pcieport 0000:00:01.0:    [ 0] RxErr                  (First)
Aug 18 00:00:01 toothbrush kernel: [101618.605604] audit: type=1400 audit(1629259201.304:83): apparmor="DENIED" operation="capable" profile="/usr/sbin/cupsd" pid=3302495 comm="cupsd" capability=12  capname="net_admin"
Aug 18 00:00:05 toothbrush kernel: [101622.407042] audit: type=1400 audit(1629259205.104:84): apparmor="DENIED" operation="capable" profile="/usr/sbin/cups-browsed" pid=3302502 comm="cups-browsed" capability=23  capname="sys_nice"
Aug 18 09:25:52 toothbrush kernel: [    0.000000] microcode: microcode updated early to revision 0x40, date = 2021-04-11
Aug 18 09:25:52 toothbrush kernel: [    0.000000] Linux version 5.11.0-25-generic (buildd@lgw01-amd64-044) (gcc (Ubuntu 10.3.0-1ubuntu1) 10.3.0, GNU ld (GNU Binutils for Ubuntu) 2.36.1) #27-Ubuntu SMP Fri Jul 9 23:06:29 UTC 2021 (Ubuntu 5.11.0-25.27-generic 5.11.22)
Aug 18 09:25:52 toothbrush kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.11.0-25-generic root=/dev/mapper/vgubuntu-root ro quiet splash vt.handoff=7
Aug 18 09:25:52 toothbrush kernel: [    0.000000] KERNEL supported cpus:

Running: last -x | head | tac

joncrall :0           :0               Mon Aug 16 19:46 - crash (1+13:38)
runlevel (to lvl 5)   5.11.0-25-generi Mon Aug 16 19:47 - 09:26 (1+13:39)
joncrall pts/3        tmux(11727).%0   Mon Aug 16 20:41 - 21:49  (01:07)
joncrall pts/23       tmux(3215922).%0 Tue Aug 17 23:46 - crash  (09:39)
reboot   system boot  5.11.0-25-generi Wed Aug 18 09:25   still running
joncrall :0           :0               Wed Aug 18 09:25   still logged in
runlevel (to lvl 5)   5.11.0-25-generi Wed Aug 18 09:26   still running

I forget exactly what the "crash" vs "still running" column means in last reboot, so I'm not sure how to interpret this, or if there is any diagnostic info here.

So, if it is thermals, I don't think the system is logging it.

My Questions + Summary:

So, my machine is turning off, and I'm not sure if its thermals, power, or something else. In an effort to mitigate thermal issues, I installed an additional 4 fans in my case, which how has 2 intake fans in the lower front, 1 intake at the bottom front, 2 outakes towards the top back, and 1 outake in the back. The NH-d15 has both fans installed on it (I double checked the orientations).

  • Are there other logs I can check to debug thermal issues?

  • Was I a fool to use air cooling, could this just be temperature fluctuations that would be mitigated by a AOI CPU water cooler?

  • Is there any other hypothesis that I'm not considering?

Update 2021-10-01

Happy October. My machine is still driving me mad. But at least I've got a few angles that I can bisect to attempt to localize the problem.

I reconfigured the hardware in an attempt to determine if the 3090 was part of the problem, and I don't think it is.

I completely removed the 3090, so the 1080ti is now the only graphics card in there. I did not change the PCIE slot the 1080ti was connected to. Previously the 3090 was on Slot 1/3 (closest slot to the CPU) and and 1080 ti was on Slot 3/3 (furthest slot). I just removed the 3090 and kept the 1080ti on slot 3. I connected the DVI cables, booted up, and ran the pytorch training code. I started training at 10:18 PM on 2021-09-30, and it was still running when I went to bed, but I woke up to a powered off machine. Looking at the logs, it seems to have powered off around 2:14 AM on 2021-10-01, so it was able to chug for nearly 4 hours before running into the issue.

So, even without the 3090, the issue persists (yay the overpriced GPU isn't the problem), although using the 3090 does seem to induce the issue faster, but it is not the root cause.

I'm wondering if I might have uncovered a vulnerability with my hardware and the type of training I'm doing. Hopefully I can find a MWE so I can point to a particular instruction that causes this (recall running standard ConvNet training with torch / tensorflow stock scripts does not trigger the issue, the code I'm running right now is training a transformer network with pytorch-lightning).

Before I do this I'm going to try a few more hardware configurations while I have the machine open.

Later: reproduced error with the 1080ti in slot 0. I suppose the next test is to try swapping out the power supply. It's going to suck undoing my cable management, but it should either rule out the PSU or hone in on it.

Later: It's not the PSU. I swapped it out with a 1000W PSU and ran an experiment at 7:43. Poweroff at 7:57. So torch-exploit, the CPU, the MOBO, others? Initial build had bad ram, but got it replaced. I'll rerun.

Best Answer

The potential solution I found on 2021-10-03 has solved the problem! I've been running for 17 days with no issues at all.

The issue was a BIOS overclock setting in my Asus ROG STRIX Z590-E GAMING WIFI ATX LGA1200 Motherboard.

enter image description here

The BIOS setting: ASUS MultiCore Enhancement: was originally set to "Auto" and setting it to "Disabled" has resolved my issues.

My guess it that the Ai Tweaker was optimized for games and not scientific workloads.

Related Question