Ubuntu – The optimal must-gather for laptop self-shutdown troubleshooting (likely due to overheating)

17.10driversgraphicsnvidiaoverheating

Premise

I have recently upgraded from 17.04 to 17.10.

I have the following video cards and processor (on a laptop):

00:02.0 VGA compatible controller: Intel Corporation Device 591b (rev
04)

01:00.0 VGA compatible controller: NVIDIA Corporation GP106M
[GeForce GTX 1060 Mobile] (rev a1)

Intel Core i7 Quad Core Processor 7700HQ (2.8GHz, 3.8GHz Turbo)

In 17.04, I was using the nvidia-375 driver (build 66 if memory serves).

After upgrading, I noticed that my steam games would run very poorly.

In some cases, some games would seemingly overheat the machine to the point that it automatically turned off.

I have added the graphics-drivers/ppa/ubuntu artful repository and switched to the later nvidia-387 driver, which seems to improve performance to similar levels as prior to my Ubuntu upgrade.

However, some games still seem to overheat my machine and lead to a hard automatic shutdown.

I have tried exploring the logs in /var/log a bit, but I am not knowledgeable enough to infer which information is relevant and which isn't, of whether there actually is any relevant information in the logs in such cases.

I have done the initial due-diligence, i.e. checking for dust and that the fans work (no dust, both fans work).

Actual question

I am not asking "how to fix this and make my games work", I realize how hard that would be to answer, given the context.

However, I would like to understand what is the recommended must-gather for such situation, so that I can either try to investigate on my own, ask a more specific question here, or (probably more suitable) convey that information to the game vendor and request for support.

As mentioned, I strongly suspect this is related to video card drivers or CPU overheating.

Update 1

I have tried and replicated the issue with a few additional Nvidia driver versions.
Here is the list I tried so far, which all replicate the issue:

  • 375.66 – used to work well in 17.04, laggy graphics in 17.10 and replicates auto-shutdowns
  • 384.90 – not tried in 17.04, laggy graphics in 17.10 (but better than 375.66), replicates auto-shutdowns
  • 387.12 – seemingly no difference compared to 384.90 within context

I also noticed that all games requiring a processor speed that would need turbo on my processor replicate the issue (some seem to take longer).

This last finding is interesting, because it means the shutdown is likely triggered after a certain time the CPU is in turbo mode, and might not be related to the GPU after all.

I have grepped for "temperat*" in /var/log, but the only entries matching are from repowerd and while I don't really understand what they mean, they show a temperature=0.00, which I suspect I can disregard as noise within context.

I'm about to change the thermald logging level and see if there's anything relevant once the issue replicates – will update later.

Update 2

I have replicated the issue after setting up the following debugging processes:

  • [as administrator] watch -n10 "sensors >> ~/sensors.log"
  • [as administrator] watch -n10 "hddtemp /dev/sda1 >> ~/hddtemp.log"

Tailing those files after starting the machine again indicates the following, seemingly acceptable temperatures:

/dev/sda1: ST1000LX015-1U7172: 37°C

iwlwifi-virtual-0
Adapter: Virtual device
temp1:        +54.0°C  

acpitz-virtual-0
Adapter: Virtual device
temp1:        +79.0°C  

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +78.0°C  (high = +100.0°C, crit = +100.0°C)
Core 0:        +77.0°C  (high = +100.0°C, crit = +100.0°C)
Core 1:        +78.0°C  (high = +100.0°C, crit = +100.0°C)
Core 2:        +72.0°C  (high = +100.0°C, crit = +100.0°C)
Core 3:        +75.0°C  (high = +100.0°C, crit = +100.0°C)

pch_skylake-virtual-0
Adapter: Virtual device
temp1:        +75.5°C `

I've grepped thermald logs from the syslogand piped them into another log file for readability.

In my debug-level thermald logs, I've tried looking for "common" patterns (I have no idea how to really read that information), within the time range of the occurrence.

Some entries did not occur close to the occurrence of the shutdown.

My search included keywords like "warn", "error", "fail", "critical", "invalid".

Here are the only findings I can share – all entries repeat, not necessarily in this order…

  • sysfs read failed constraint_0_max_power_uw – occurred before and close to shutdown
  • dram:powercap RAPL invalid max power limit range
  • failed to open /dev/acpi_thermal_rel
  • read_trip_points 1/trip_point_0_type:critical
  • index 0: type:critical temp:115000 hyst:1 zone id:1 sensor id:1 cdev size:0
  • Buggy max temp: to close to critical 90000
  • Core temp DTS :critical 100000, max 90000, psv 95000

As my initial grep for thermald logs was a little wide, I also bumped into some maybe relevant kernel log entries:

  • thermal thermal_zone2: failed to read out thermal zone (-5) – occurred close to shutdown

This would narrow down to either or both of the entries close to shutdown replication time.

However, I still have no clue how to read that data, or whether I am completely mislead in gathering the data in the first place.

Maybe my watch interval should be much shorter?

Maybe there is actually no overheating, but some (kernel?) issue that prevents a proper read of the temperatures?

Any clarification welcome.

Last update, off-topic

I have now reinstalled Ubuntu 17.04.

The issue does not replicate.

The figures from sensors and hddtemp are slightly lower than the ones tested with 17.10, but only slightly.

Note that I need to parametrize the kernel with pci=noacpi on 17.04 in order to be able to start/shutdown properly. Maybe it's related…
I guess I'll stay clueless for now…

Best Answer

I had the same issue which has also started after upgrading to 17.10. Also, my specs are quite similar.

Finally, I was able to resolve it by simply booting in UEFI mode.

It makes my CPU driver behave more optimally:

  • In BIOS boot mode performance governor is always on with turbo boost and current frequency is always equal to max frequency.
  • In UEFI powersave is preferred with performance mode kicking in when needed and frequency rising on demand.

No more overheating issues. Tested back and forth.

Update: Troubleshooting steps I've taken

Step 1: Check logs in /var/log. System and kernel logs reported temperature reaching high levels several minutes before each shutdown:

Nov 12 13:36:20 kernel: [ 899.138274] CPU0: Package temperature above threshold, cpu clock throttled (total events = 1)

Nov 12 13:36:20 kernel: [ 899.139245] CPU0: Core temperature/speed normal

Notice that in the very same second it says that temperature is back to normal. Weird, but nothing else was suspicious in logs.

Step 2: Measure the temperature which triggers shutdown. I used lm-sensors to watch sensor values every second and dump the results to a file. The shutdown temperatures were more or less at 95 celcius - a few degrees short of 100 which normally should trigger shutdown.

Step 3: Test various power/temperature management packages like tlp, laptop-mode-tools, cpufreq, cpupower, etc. -- none of them helped.

Step 4: Examine /sys/devices/system/cpu/cpu*/cpufreq directory for clues. I noticed that scaling_cur_freq, scaling_min_freq and scaling_max_freq files always showed the same value which is 3500000 for me. Also 3,5 GHz is a turbo boost mode. Weird.

Step 5: Use cpupower to manually change the CPU governor to powersave and later to throttle the CPU. Did not help. It looked, however, as if the CPU did not throttle even if the command succeeded.

Step 6: Change the CPU driver and disable intel_pstate in grub config file.

Step 7: Switch to alternative graphic card drivers -- did not help at all.

Step 8: Disassemble the laptop and clean it - TINY bit better, but did not resolve the issue :)

Step 8: Change boot mode since it could potentially influence low-level drivers. I repeated step 4 afterwards and noticed the CPU behaved differently indeed.

Maybe someone else will be able to enlighten us how this actually works :)

Related Question