Ubuntu – Memory errors with Ubuntu but not with MemTest86+

data-corruptionram

I got some btrfs and ext4 errors. After deciding to test my RAM I got the following repeating errors with memtester. I always do get similar errors after a bit of running the memtester. Usually in an hour, but it took 4-5 hours in one time.

My computer's RAM is soldered. I got additional empty slot. There are no settings in BIOS to disable on-board RAM.

I've ran:

  • Memtest86+ for 8 passes (~8 hours)
  • MemTest86 for 18 passes (~9 hours)
  • memtester and stressapptest on Fedora 27 default, installed on a USB stick (~10 hours)
  • memtester and stressapptest on Ubuntu 17.10 Live default (~2 hours)
  • memtester and stressapptest on Ubuntu 17.10 on USB stick (~8 hours)
  • # debsums --changed the only changed file was an image of a theme.

They didn't print any errors.

I am using Ubuntu 17.10 (upgraded from 17.04) with default kernel. Kernel is not tainted. It's an ASUS laptop with Intel Haswell i3.

  • Also tested with Linux 4.14.13 and 4.15.0-rc3,rc4, mainline.
  • Also tested with purged intel-microcode package.

Error is reproducible either Nouveau is disabled or enabed, no nvidia binary drivers are loaded.

Blacklisted the following modules: mtd intel_spi_platform intel_spi because they don't load on default Fedora 27 install and they seem to brick some Lenova laptops. Errors have not stopped.

uname -a's output

Linux hostname 4.13.0-19-generic #22-Ubuntu SMP Mon Dec 4 11:58:07 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

# lsmod's output

https://paste.ubuntu.com/26222245/

Fedora 27's # lsmod's output

https://paste.ubuntu.com/26226473/

Current Situation

I've put my HDD into a laptop (backup laptop) that I've known to be good and ran the tests there. I got the errors. Now I am pretty sure this is a software issue. I've never been able to trigger the errors on my laptop with a fresh Ubuntu nor with a Fedora trying many many hours.

What should I do?

A sample of the errors:

Loop 6:
  Stuck Address       : ok         
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok         
  Block Sequential    : ok         
  Checkerboard        : ok         
  Bit Spread          : ok         
  Bit Flip            : testing 262
FAILURE: 0x00000000 != 0xfffffffeffffffff at offset 0x0ef94000.
FAILURE: 0x00000000 != 0x100000000 at offset 0x0ef94008.
FAILURE: 0x00000000 != 0xfffffffeffffffff at offset 0x0ef94010.
FAILURE: 0x00000000 != 0x100000000 at offset 0x0ef94018.
FAILURE: 0x00000000 != 0xfffffffeffffffff at offset 0x0ef94020.
FAILURE: 0x00000000 != 0x100000000 at offset 0x0ef94028.
FAILURE: 0x00000000 != 0xfffffffeffffffff at offset 0x0ef94030.
FAILURE: 0x00000000 != 0x100000000 at offset 0x0ef94038.
  Walking Ones        : ok         
  Walking Zeroes      : ok         
  8-bit Writes        : ok
  16-bit Writes       : ok

A similar error with the both RAM slots are full:

Loop 1:
  Stuck Address       : ok         
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok         
  Block Sequential    : ok         
  Checkerboard        : ok         
  Bit Spread          : testing   4
FAILURE: 0x00000000 != 0x00000050 at offset 0x7da80000.
FAILURE: 0x00000000 != 0xffffffffffffffaf at offset 0x7da80008.
FAILURE: 0x00000000 != 0x00000050 at offset 0x7da80010.
FAILURE: 0x00000000 != 0xffffffffffffffaf at offset 0x7da80018.
FAILURE: 0x00000000 != 0x00000050 at offset 0x7da80020.
FAILURE: 0x00000000 != 0xffffffffffffffaf at offset 0x7da80028.
FAILURE: 0x00000000 != 0x00000050 at offset 0x7da80030.
FAILURE: 0x00000000 != 0xffffffffffffffaf at offset 0x7da80038.
  Bit Flip            : setting 141

An error of stressapptest:

Report Error: miscompare : DIMM Unknown : 1 : 157s
Hardware Error: miscompare on CPU 2(0x2) at 0x7fcc0726e000(0xb0d18:DIMM Unknown): read:0x0000000000000000, reread:0x0000000000000000 expected:0x4a4a4a4a4a4a4a4a
Report Error: miscompare : DIMM Unknown : 1 : 157s
Hardware Error: miscompare on CPU 2(0x2) at 0x7fcc0726e008(0xb0d18:DIMM Unknown): read:0x0000000000000000, reread:0x0000000000000000 expected:0x4a4a4a4a4a4a4a4a
Report Error: miscompare : DIMM Unknown : 1 : 157s
Hardware Error: miscompare on CPU 2(0x2) at 0x7fcc0726e010(0xb0d18:DIMM Unknown): read:0x0000000000000000, reread:0x0000000000000000 expected:0x4a4a4a4a4a4a4a4a
Report Error: miscompare : DIMM Unknown : 1 : 157s
Hardware Error: miscompare on CPU 2(0x2) at 0x7fcc0726e018(0xb0d18:DIMM Unknown): read:0x0000000000000000, reread:0x0000000000000000 expected:0x4a4a4a4a4a4a4a4a
Report Error: miscompare : DIMM Unknown : 1 : 157s
Hardware Error: miscompare on CPU 2(0x2) at 0x7fcc0726e020(0xb0d18:DIMM Unknown): read:0x0000000000000000, reread:0x0000000000000000 expected:0x4a4a4a4a4a4a4a4a
Report Error: miscompare : DIMM Unknown : 1 : 157s
Hardware Error: miscompare on CPU 2(0x2) at 0x7fcc0726e028(0xb0d18:DIMM Unknown): read:0x0000000000000000, reread:0x0000000000000000 expected:0x4a4a4a4a4a4a4a4a
Report Error: miscompare : DIMM Unknown : 1 : 157s
Hardware Error: miscompare on CPU 2(0x2) at 0x7fcc0726e030(0xb0d18:DIMM Unknown): read:0x0000000000000000, reread:0x0000000000000000 expected:0x4a4a4a4a4a4a4a4a
Report Error: miscompare : DIMM Unknown : 1 : 157s
Hardware Error: miscompare on CPU 2(0x2) at 0x7fcc0726e038(0xb0d18:DIMM Unknown): read:0x0000000000000000, reread:0x0000000000000000 expected:0x4a4a4a4a4a4a4a4a

I suspect somehow Ubuntu's configuration combined with my Laptop's hardware is to blame about these errors. Almost every time in packs of eight.

Unimportant, loosely related info below

About the btrfs errors; I was using 17.04. I've asked around in btrfs' irc. I've been told that it could be a hardware error or somehow a memory management error. A portion of the metadata page of the btrfs got filled with zeros, just like I've been experiencing now. I did ran memtester just a few passes, switched to ext4 and put the blame on nvidia binary driver.

The commands and their parameters that I use:

# stressapptest -M 10000 -s 1800

10000 is the available memory that I can test. I get it via free -m-s` is seconds.

# memtester 4096

The laptop's CPU has 2 cores so I usually start two instances. 4096
is half of current available memory via free -m

Best Answer

Deleted answer was close

An answer was deleted on this Q&A:

Did you already try re-installing ubuntu because it sounds like an OS level memory managment failure

My answer is similar as it involves very low level memory management; KASLR at the Kernel level.

What KASLR does

KASLR stands for Kernel Address Space Layout Randomization. I've never heard it spoken out-loud but in my mind I pronounce it "Casler". Think friendly ghost in the machine. KASLR is a security measure to randomize which memory locations kernel modules reside. The theory is the kernel is harder to hack when you can't rely on the same bit of code always being in the same memory spot.

KASLR operation could be considered an opposite of memory testers which repetitively read and write to the same memory locations expecting NO CHANGES. These being opposites, it attracted me (idiom noticed), to do a google search on KASLR and memory errors. One in particular seemingly unrelated might deserve a message on github linking to this Q&A. The reason being they think they are the only ones effected by shifting memory addresses (if I'm reading their thread correctly). The first three hits are from RedHat who I'm loath to link to because their websites are partial posts to get on google search robots and then they make you pay to read.

There are known problems when KASLR loads kernel "stuff" into into the middle of the memory map which it isn't supposed to do. Unfortunately I can't recall the link I found last week to include in tonight's answer. The link had a patch / workaround for directing KASLR to not use specific memory locations.

After confirming known problems with KASLR and memory locations I commented under the question to disable it KASLR and rerunning memory tests. A reply stated it appears to be successful so I'm posting this answer.

How to disable KASLR

Although I've been using grub kernel command line option "kaslr" for a couple of years now, it became the kernel default since at least version 4.12. To eliminate KASLR from loading use edit /etc/default/grub and change this line:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash nokaslr"

You might have other options besides "quiet" and "splash". The important step is to add "nokaslr" and leave the other options in place.

Then save the file and run:

sudo update-grub

Of course another way of disabling KASLR is to simply use an older Kernel like 4.4.0 under Ubuntu 16.04.1 when KASLR wasn't automatically included.

Related Question