Ubuntu – Regular freezing on Ryzen based system, 16.04 LTS and newer kernel

16.04amd-processorcrashfreezekernel

I am running Ryzen 1700X CPU and doing computations. Every now and then the system crashes, while running 16.04 LTS (Kernel 4.10). The system does not reboot. There is no signal on display and the keyboard + mouse do not work. I cannot connect via SSH.

I saved the kern.log and syslog files while running 16.04 LTS.

After reading several posts, and reading issues about the new architecture and issues, I decided to try more recent kernel and I moved to 4.12.8 (dated 16th Aug, 2017) from here.
I used this post on AskUbuntu to update the kernel.
System booted fine and my application ran fine for ~10 hours now.

After about ~11 hours system crashed again, with the same messages in the syslog as seen with kernel 4.10 on 16.04 LTS, given below. {Kernel and syslog files, with 4.12 kernel: kern.log with new kernel and syslog with new kernel }

Aug 18 17:27:13 vriksha systemd[1]: Starting Cleanup of Temporary Directories...
Aug 18 17:27:13 vriksha systemd-tmpfiles[4661]: [/usr/lib/tmpfiles.d/var.conf:14] Duplicate line for path "/var/log", ignoring.
Aug 18 17:27:13 vriksha systemd[1]: Started Cleanup of Temporary Directories.
Aug 18 17:28:25 vriksha ntpd[1516]: 209.242.224.117 local addr 192.168.2.15 -> <null>
Aug 18 17:35:01 vriksha CRON[4821]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 18 17:35:40 vriksha systemd[1]: Started Session 5 of user vani.
Aug 18 17:42:18 vriksha sensord: Chip: amdgpu-pci-2700
Aug 18 17:42:18 vriksha sensord: Adapter: PCI adapter
Aug 18 17:42:18 vriksha sensord:   fan1: 1423 RPM
Aug 18 17:42:18 vriksha sensord:   temp1: 43.0 C
Aug 18 17:42:18 vriksha sensord: Chip: asus-isa-0000
Aug 18 17:42:18 vriksha sensord: Adapter: ISA adapter
Aug 18 17:42:18 vriksha sensord:   cpu_fan: 0 RPM
Aug 18 17:45:01 vriksha CRON[6142]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 18 17:55:01 vriksha CRON[6431]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 18 18:05:01 vriksha CRON[6607]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 18 18:09:52 vriksha kernel: [ 3459.913711] perf: interrupt took too long (2529 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
Aug 18 18:12:18 vriksha sensord: Chip: amdgpu-pci-2700
Aug 18 18:12:18 vriksha sensord: Adapter: PCI adapter
Aug 18 18:12:18 vriksha sensord:   fan1: 1431 RPM
Aug 18 18:12:18 vriksha sensord:   temp1: 40.0 C
Aug 18 18:12:18 vriksha sensord: Chip: asus-isa-0000
Aug 18 18:12:18 vriksha sensord: Adapter: ISA adapter
Aug 18 18:12:18 vriksha sensord:   cpu_fan: 0 RPM
Aug 18 18:15:01 vriksha CRON[6785]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 18 18:17:01 vriksha CRON[6825]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Aug 18 18:25:01 vriksha CRON[6967]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)

After the last line in the above message (in syslog) the system froze. I had to reset to reboot again. This happened again with the new kernel.

System details:

CPU-1700X Ryzen, No SMT, BIOS version- 3401 dated 12/08/2017 (AGESA 1071)
RAM 32 GB
AMD RX 470 GPU 
Lubuntu 16.04 LTS, LXDE with Openbox

Can somebody help me out.


Updates

The application I am running is not using gcc, g++.

  1. lspci output is here.

  2. dmesg | egrep 'drm|radeon' output is here

  3. (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) is related to the sysstat package which I removed. The problem still exists.

  4. glxinfo | grep -i open output for AMD RX 470 GPU is given below

    glxinfo | grep -i open 
    OpenGL vendor string: X.Org
    OpenGL renderer string: Gallium 0.4 on AMD POLARIS10 (DRM 3.15.0 / 4.12.8-041208-generic, LLVM 4.0.0)
    OpenGL core profile version string: 4.5 (Core Profile) Mesa 17.0.7
    OpenGL core profile shading language version string: 4.50
    OpenGL core profile context flags: (none)
    OpenGL core profile profile mask: core profile
    OpenGL core profile extensions:
    OpenGL version string: 3.0 Mesa 17.0.7
    OpenGL shading language version string: 1.30
    OpenGL context flags: (none)
    OpenGL extensions:
    OpenGL ES profile version string: OpenGL ES 3.1 Mesa 17.0.7
    OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.10
    OpenGL ES profile extensions:
    
  5. I have connected only one display to this computer. The crashes happen only when running CPU intensive tasks for long durations of time. ( I leave the system with its display off, controlling it, checking it from a SSH connection. After 5-6 hours or so, SSH connection becomes unavailable. After coming back to the machine, moving mouse and keyboard do nothing to bring the display back. A hard reset is required).

  6. To check if this is because of GPU or not, I changed to nVidia GTX 1080 for which I installed the proprietary driver and still under the similar load, the system freezes. I changed back to AMD GPU and there the problem persists. I rule out this behavior due to GPU build type. For the nVidia card the glxinfo | grep -i open output is following;

    OpenGL vendor string: NVIDIA Corporation
    OpenGL renderer string: GeForce GTX 1080/PCIe/SSE2
    OpenGL core profile version string: 4.5.0 NVIDIA 384.81
    OpenGL core profile shading language version string: 4.50 NVIDIA
    OpenGL core profile context flags: (none)
    OpenGL core profile profile mask: core profile
    OpenGL core profile extensions:
    OpenGL version string: 4.5.0 NVIDIA 384.81
    OpenGL shading language version string: 4.50 NVIDIA
    OpenGL context flags: (none)
    OpenGL profile mask: (none)
    OpenGL extensions:
    OpenGL ES profile version string: OpenGL ES 3.2 NVIDIA 384.81
    OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20
    OpenGL ES profile extensions:
    
    1. Updated the BIOS to version 3401 (12/08/2017, AGESA 1071) and the problem persists.

Best Answer

I had the same problem... What I did to solve this issue:

Performance:

sudo cpufreq-set -r -g performance

Set on boot:

sudo apt-get install cpufrequtils
echo 'GOVERNOR="performance"' | sudo tee /etc/default/cpufrequtils
sudo systemctl disable ondemand