Linux – Why can’t linux kernel boot on the new Intel i7-6500U CPU

bootkernel-paniclinux-kernel

It' hard to isolate the CPU, I know, but the errors I'm seeing suggest that's the issue.

This is definitely not a malfunctioning/broken hardware problem. I've been running Windows 10 all day for the past several days and this thing is flippin' fast! No crashing. More importantly, I ran Windows memory checker. Memory is all good.

machine specs

The machine is a brand new Lenovo Yoga 710 15"

x64
Intel i7-6500 CPU @ 2.50 GHz, 2601 Mhz, 2 Cores, 4 Logical Processors
SMBIOS Version 2.8
BIOS Mode UEFI
16.0 GB DDR4 Ram
256 MB SSD

isolating to linux kernel (?)

I've seen the same problems on both

  • archlinux-2016.08.01-dual.iso
  • ubuntu-gnome-16.04.1-desktop-amd64.iso

For Arch — the problem was only appearing intermittently at boot from the USB stick. I managed to get Arch installed on a 100GB ext4 partition on the drive. That install has the same issue intermittently (like 90% of the time) during boot. If I get passed the boot, then the issue appears at random after the first couple of terminal commands I execute, eventually causing a complete deadlock.

For Ubuntu — the USB stick doesn't even boot. I get stopped by these same errors immediately. Deadlock…

So many errors…

The journal is stuffed with memory-related errors whenever this happens, but the key errors I'm seeing are:

  • General protection fault 0000[#1] PREEMPT SMP
  • RIP kmem_cache_alloc
  • RIP kmem_cache_alloc_trace

I've seen some of the same stack traces several times for these errors:

rbt_memtype_copy_nth_element
on_each_cpu
flusH_tbl_kernel_range
__purge_umap_area_lazy
um_unmam_aliases
change_page_attr_set_clr
set_memory_ro
frob_text.isra
module_enable_ro

kobject_create
kobject_create_and_add
load_module
__symbol_put
kernel_read
sys_finit_module
entry_SYSCALL_64_fastpath

kmem_cache_alloc_trace
allocate_cgrp_cset_links
...
sys_write
entry_SYSCALL-64_fastpath

Linux also keeps promising that it's fixing the problem

Fixing recursive fault but reboot is needed!

I wish..

intel ucode

I also tried installing the intel-ucode package in my Arch install. I saw in the dmesg logs that the microcodes were updated, but that unfortunately did not solve my problem.

What could be the issue? How can fix it?


EDIT

Additional note.

The general protection fault messages and "lock up detected"-type messages typically reference a CPU. I've seen CPU0, CPU1 , CPU2 and CPU3 in these messages. It seems like something is causing the CPU's to not get along, like they're all in a deadlock trying to clear out cache memory or something.


EDIT2

BIOS mentioned in error

I see this bit of information in some errors:

LENOVO 80U01LENOVO YOGA710-1 BIOS OGCN20WW(v1.04) 6/30/2016

Not sure if that is helpful to a pro in understanding the issue…


EDIT3

maxcpus=1

I was looking for debugging options in the kernel params documentation and found maxcpus

If I set max cpu's to 1, then the problem goes away. So it would seem that the problem is some kind of shared cache memory violation.


EDIT3

maxcpus=1 + Gnome = broken again

Although maxcpus=1 seemed to make the system work with just the 1 CPU, I installed gnome and then ran systemctl enable gdm.service

Now, when I reboot, I get all of my errors back again, but this time they're all happening on CPU0

So it seems that something is still causing a memory violation even with the 1 CPU.


EDIT4

nolapic

So using nolapic seems to get everything "working"

BUT by using nolapic, I effectively disable my other CPU and all multithreading in the 1 working CPU.

I'm trying to use this for OpenMP, and after booting with nolapic, OpenMP and the linux kernel can only find 1 thread, 1 CPU. That sucks!

I also tried intel_idle.max_cstate=0 and 1,2, etc. But this does not fix the boot problem.

What else could cause the kernel to fail to utilize my multi-core machine?

Best Answer

Turns out the issue was i2c_hid

This seems to be some kind of touchpad driver. For some reason, when I disable it, I can still use my touchpad. It could be that the touch screen on the laptop was using this driver, too, because that doesn't work.

I don't like to mung up my laptop screen with fingerprints, anyway... So bye bye i2c_hid!

I fixed it by adding this to the kernel params: modprobe.blacklist=i2c_hid

Although nolapic also worked, it disabled all but 1 core in processors.

I'd highly recommend to anyone else out there to not use apci=off or nolapic for this reason.

Using these options is a nuclear weapon that might make your machine work, but you will lose a lot of performance and/or i/o devices as collateral damage. It's a good starting point to get booted, and then you can pour throught journalctl like I did to analyse the boots that fail.

Good luck to those who find this.

Related Question