From: What are all these "Bug: soft lockup" messages about?
Situation
In the system log (/var/log/messages or journalctl) a lot of the
following messages is printed.
May 25 07:23:59 XXXXXXX kernel: [13445315.881356] BUG: soft lockup - CPU#16 stuck for 23s! [yyyyyyy:81602]
followed by various stack traces. This document tries to explain what
the soft lockup messages mean.
The error message itself doesn't tell you what is causing the problem.
Cause
A 'soft lockup' is defined as a bug that causes the kernel to loop in
kernel mode for more than 20 seconds, without giving other tasks a
chance to run. The watchdog daemon will send an non maskable interrupt
(NMI) to all CPUs in the system who in turn print the stack traces of
their currently running tasks.
Reducing server load is normal solution:
Resolution
Under normal circumstances those messages may go away if the load
decreased. This 'soft lockup' can happen if the kernel is busy,
working on a huge amount of objects which need to be scanned, freed or
allocated respectively. The stack traces of those tasks can give a
first idea what the tasks were doing. However, to be able to examine
the cause behind the messages, a kernel dump would be needed.
You cannot disable those messages, however in some situations
increasing the time when those soft lockups will be fired can relax
the situation.
Do do so just increase the following sysctl
parameter:
kernel.watchdog_thresh
Default value for this parameter is 10
and
to double the value might be a good start.
Possible swap/memory problem.
BIOS
Your have BIOS version S1200SP.86B.03.01.0042.013020190050 dated 01/30/2019.
There's a newer BIOS available, dated June 2020, and it can be downloaded here.
Note: Have good backups before updating the BIOS.
Memtest
Go to https://www.memtest86.com/ and download/run their free memtest
to test your memory. Get at least one complete pass of all the 4/4 tests to confirm good memory. This may take many hours to complete.
Update #1:
As I previously thought... you have swap problems.
You have THREE swap locations, as seen in /etc/fstab!
UUID="X-X-X-X-X" swap swap defaults 0 0
UUID="X-X-X-X-X" swap swap defaults 0 0
/swapfile swap swap defaults 0 0
Do sudo swapoff -a
# turn off swap
Then comment out ALL three of the above lines in /etc/fstab.
It's never ok to completely disable swap. It's not appropriate to have too small of a swap. You have both problems.
Let's create an appropriate /swapfile for your system.
Note: Incorrect use of the dd
command can cause data loss. Suggest copy/paste.
sudo swapoff -a # turn off swap
sudo rm -i /swapfile # remove old /swapfile
sudo dd if=/dev/zero of=/swapfile bs=1M count=4096
sudo chmod 600 /swapfile # set proper file protections
sudo mkswap /swapfile # init /swapfile
sudo swapon /swapfile # turn on swap
free -h # confirm 32G RAM and 4G swap
Add this line to /etc/fstab...
/swapfile none swap sw 0 0
Then reboot
the system and verify operation.
If it all works, you can use gparted
to delete the two disk partitions with the UUIDs shown in the commented out lines in /etc/fstab. Be careful here, and assure that you've got the correct partitions to delete. Then delete those three commented out lines in /etc/fstab.
Best Answer
I had the same bug on my machine, I fixed it by appending
nouveau.modeset=0
to the grub command lineTo do so, when you're in grub menu, press e to edit the command line. Then append
nouveau.modeset=0
at the end of the line beginning withlinux
. Then press F10 to continue to boot, login into your user session, then try to reboot the computer to see if the problem is gone or if you still have the issue.If the problem is gone, you can make the change permanent by editing the grub config.
sudo nano /etc/default/grub
, appendnouveau.modeset=0
in the quotes of theline GRUB_CMDLINE_LINUX_DEFAULT="..."
. Then update your grub setup withsudo update-grub
.Related to