Do with the output of memtester when it shows bad memory

memoryram

Memtester has outputted the following response,

memtester version 4.3.0 (64-bit)
Copyright (C) 2001-2012 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).

pagesize is 4096
pagesizemask is 0xfffffffffffff000
want 10240MB (10737418240 bytes)
got  10240MB (10737418240 bytes), trying mlock ...locked.
Loop 1/1:
  Stuck Address       : testing   1FAILURE: possible bad address line at offset 0x12325b7a8.
Skipping to next test...
  Random Value        : ok
FAILURE: 0xa003776ad640ac0c != 0xe003776ad640ac0c at offset 0x7a4f2680.
  Compare XOR         : FAILURE: 0xe7139f89d94112c0 != 0x27139f89d94112c0 at offset 0x7a4f2680.
FAILURE: 0x4e53ee3a9704bdf5 != 0x4a53ee3a9704bdf5 at offset 0x950b4930.
  Compare SUB         : FAILURE: 0x96ecab120464e9c0 != 0xd6ecab120464e9c0 at offset 0x7a4f2680.
FAILURE: 0x7f67022cef637b99 != 0x2b67022cef637b99 at offset 0x950b4930.
FAILURE: 0x96c38c9f6e6dd229 != 0xd6c38c9f6e6dd229 at offset 0xe40d2b50.
  Compare MUL         : FAILURE: 0x00000001 != 0x00000002 at offset 0x69394a08.
FAILURE: 0x00000001 != 0x00000000 at offset 0x950b4930.
FAILURE: 0x400000000000001 != 0x00000001 at offset 0xea6b07a8.
FAILURE: 0x400000000000000 != 0x00000000 at offset 0xfb853610.
FAILURE: 0x00000000 != 0x800000000000000 at offset 0x12bf3ed10.
  Compare DIV         : FAILURE: 0x777fd9f1ddc6c1cd != 0x777fd9f1ddc6c1cf at offset 0x69394a08.
FAILURE: 0x777fd9f1ddc6c1cd != 0x7f7fd9f1ddc6c1cd at offset 0x12bf3ed10.
  Compare OR          : FAILURE: 0x367600d19dc6c040 != 0x367600d19dc6c042 at offset 0x69394a08.
FAILURE: 0x367600d19dc6c040 != 0x767600d19dc6c040 at offset 0x7a4f2680.
FAILURE: 0x367600d19dc6c040 != 0x3e7600d19dc6c040 at offset 0x12bf3ed10.
  Compare AND         :   Sequential Increment: ok
  Solid Bits          : testing   0FAILURE: 0x4000000000000000 != 0x00000000 at offset 0x12325b7a8.
  Block Sequential    : testing   0FAILURE: 0x400000000000000 != 0x00000000 at offset 0xfb853610.
  Checkerboard        : testing   1FAILURE: 0xaaaaaaaaaaaaaaaa != 0xeaaaaaaaaaaaaaaa at offset 0x7a4f2680.
  Bit Spread          : testing   1FAILURE: 0xdffffffffffffff5 != 0xfffffffffffffff5 at offset 0x102e353e8.
  Bit Flip            : testing   0FAILURE: 0x4000000000000001 != 0x00000001 at offset 0x12325b7a8.
  Walking Ones        : testing  40FAILURE: 0xdffffeffffffffff != 0xfffffeffffffffff at offset 0x102e353e8.
  Walking Zeroes      : testing   0FAILURE: 0x400000000000001 != 0x00000001 at offset 0xea6b07a8.
FAILURE: 0x400000000000001 != 0x00000001 at offset 0xfb853610.
  8-bit Writes        : -FAILURE: 0xfeefa0a577dfa825 != 0xdeefa0a577dfa825 at offset 0x4bd600e8.
  16-bit Writes       : -FAILURE: 0xf3dfa5fff79e950b != 0xf7dfa5fff79e950b at offset 0x2b04cca8.
FAILURE: 0x3ffb3fc56e7532c1 != 0x7ffb3fc56e7532c1 at offset 0xe40d2b50.

Done.

Clearly this shows bad memory. Is it possible to mark this memory as bad in the kernel or hypervisor and keep using it? Or is to put it in File 13 and buy replacement?

Best Answer

Unless you can detect errors reasonably quickly, e.g. with ECC memory or by rebooting regularly with memtest, it’s better to replace the module. You risk silent data corruption.

You can tell the kernel to ignore memory by reserving it, with the memmap option (see the kernel documentation for details):

memmap=nn[KMG]$ss[KMG]

[KNL,ACPI] Mark specific memory as reserved. Region of memory to be reserved is from ss to ss+nn.

Example: Exclude memory from 0x18690000-0x1869ffff

memmap=64K$0x18690000

or

memmap=0x10000$0x18690000

Some bootloaders may need an escape character before '$', like Grub2, otherwise '$' and the following number will be eaten.

The difficult part here is figuring out what address ranges to reserve; memtester gives you addresses from its virtual address space, which don’t match physical addresses as needed for memmap.

The simplest approach is to boot with memtest, you'll see something like this

4c494e5558726c7a bad mem addr 0x000000012f9eaa78 - 0x000000012f9eaa80 reserved
4c494e5558726c7a bad mem addr 0x00000001b86fe928 - 0x00000001b86fe930 reserved
0x000000012f9eaa80 - 0x00000001b86fe928 pattern 4c494e5558726c7a

The kernel will then inactivate the range that it detects to be bad. You can continue booting with memtest, or use the reserved address ranges to construct memmap arguments instead.