After some recent updates, my computer no longer boots! Here's what I could determine:
- This is a very recent computer that was provided to me by corporate IT. It has a recent Intel CPU (Skylake generation).
- The computer runs Ubuntu 16.04.
- The computer last booted correctly some time in March. The problem is presumably due to a software update or a hardware bug.
- I have another computer running 16.04 with pretty much the same software installed (I used
apt-clone
), and it works just fine. It has different hardware (also amd64, but different CPU, different GPU, etc.). - The kernel does start, the initrd works correctly. When I boot with a splash screen in graphics mode, I get prompted for the password for my dm-crypt volume, and the last thing I see is that it's mounted successfully.
- The hang occurs before I get a login prompt. When the computer hangs, it's a hard hang. Even Alt+SysRq doesn't respond. The CPU is evidently pegged at 100% since the fans turn on at full blast.
- I still have the kernel I was running before rebooting. When I select this kernel in the Grub menu, I get the same lockup. So it looks like this is a pre-existing kernel bug which gets triggered by something else — but what?
- If I switch off the splash screen (remove
splash
from thelinux
command line in Grub), I see a number of services starting, then it locks up. -
I can get a root shell by adding
init=/bin/sh
to thelinux
command line in Grub. I can even get further by addingsystemd.unit=basic.target systemd.shell
This starts a number of services and runs a root shell on tty9.
- If I run
systemctl start multi-user.target
from that root shell, the computer locks up. So presumably the problem is triggered by one of these services. - I ran
systemctl list-dependencies multi-user.target
to see what services get started. I manually started the listed dependencies one by one, and everything started just fine.
So this looks like a hardware bug (since it occurs on one computer but not on the other one) that gets triggered by some software. But what software? Since the computer locks up so hard, I can't get any logs. I can't even get any useful console output.
Useful debugging techniques:
- Alt+SysRq: magic SysRq key, which lets you do things such as an emergency reboot. It accesses the kernel at a very low level, so it works in all but the worst crashes. In my case, Alt+SysRq doesn't respond, which shows how deep the crash goes.
- To modify the boot parameters, press and hold Shift a few seconds after switching the power on. You need to press it after the BIOS has initialized the keyboard, but before the operating system boots. This makes the Grub menu appear.
- At the Grub menu, press e to edit the command line for a menu entry. To change the Linux boot parameters, navigate to the line that starts with
linux
. On a modern Ubuntu, you'll find old kernels under “Advanced options for Ubuntu”. Once you've made the desired changes to the command line, press Ctrl+x to boot. Any change you make here are for this boot only, they aren't saved to disk. - Some useful options on the
linux
command line:quiet nosplash
hides almost all boot messages. Remove them to get messages on the console during boot, which is necessary to have any chance of diagnosing problems.recovery
gives you a root shell with almost no services. You'll need to know the root password. The “recovery mode” menu entry uses this.init=/bin/sh
gives you a root shell with no services at all. To resume normal boot, runexec init
. You can pass systemd options at this point, e.g.exec init --unit=basic.target
to start init and a few services (note that this does not start any way to log in, so you'd better have a shell running on another console). Note that the root filesystem is mounted read-only; runmount -o remount,rw /
to be able to write to it.systemd.unit=basic.target
starts a very basic set of services. Note that this does not include any way to log in! You can make this the default by runningsystemctl set-default basic.target
at a root prompt. To restore the original default target, runsystemctl set-default graphical.target
(orsystemctl set-default multi-user.target
for a server with no GUI).systemd.debug-shell
starts a root shell on tty9. You can enable this for every boot by runningsystemctl enable debug-shell
at a root prompt. Don't forget to disable this after you've solved the problem withsystemctl disable debug-shell
. Press Alt+F9 to switch to tty9.- See also Fedora systemd tips, Arch Linux boot problem tips.
Best Answer
The problem
It turns out that my problem is a known issue between the latest Intel microcode on (some?) Skylake CPUs and recent Linux kernels, which is mainly triggered by sssd. See Ubuntu bug #1759920 “intel-microcode 3.20180312.0 causes lockup at login screen(w/ linux-image-4.13.0-37-generic)”, and also a number of other bugs which turn out to be about the same issue, such as Ubuntu bug #1746806 “sssd appears to crash AWS c5 and m5 instances, cause 100% CPU” and Ubuntu bug #1746418 “System freezes when starting Xorg after installing linux-image-4.13.0-32-generic”. You are likely to encounter this bug if:
The bug is due to mitigations for the Spectre security issue that was published in January 2018. There's an incompatibility between some kernel code and some processor microcode that causes a lock-up in certain circumstances.
How to repair
noibpb
parameter to the kernel command line (1746418/14, 1759920/56). This should let you boot normally and perform some repairs.This disables the vulnerability mitigation that causes the problem, which means that your computer is now vulnerable to some attacks. They're local attacks, i.e. the attacker needs to run code on your machine, but these attacks may potentially be carried out e.g. through JavaScript in a web browser.
If you don't have any other way, you can make this permanent by adding
noibpb
to the kernel command line until you can get a fixed kernel.How I diagnosed the issue
I tried several things (see the question) and determined that the bug was triggered somewhere between reaching
basic.target
and reachingmulti-user.target
. So I set the default systemd target tobasic.target
(systemctl set-default basic.target
) and enabled thedebug-shell
service (systemctl enable debug-shell
) to get a root shell.I ran
systemctl list-dependencies multi-user.target
and manually started the listed dependencies one by one. This did not trigger the crash.Not all services are managed directly by systemd. Some are managed as Upstart services and some are managed as SysVinit scripts. The shell script below runs all of them. Note: I only tested it once, and it crashed by design.
My computer crashed after starting
sssd
. From there, a web search on “sssd linux kernel hang” led me to https://bugs.launchpad.net/cloud-images/+bug/1746806 and to the diagnosis and solution.