How to Enable and Verify ECC RAM Scrubbing in Linux

ecclinux-kernelmemory

I bought my first system with ECC RAM and trying to learn about its possibilities when it comes to alerting and maintenance in Linux. To be specific, Debian Linux on a Super Micro H8SGL motherboard with an AMD Opteron 6386 SE CPU and Samsung M393B2G70QH0-YK0 DDR3 ECC RAM.

I have learnt that it is possible to scrub ECC RAM, which sounds like an excellent idea. ECC RAM can normally repair 1-bit errors and detect 2-bit errors. Scrubbing involves periodically reading RAM to preemptively repair the 1-bit errors before they end up 2-bit errors.

I also learnt that Linux supports this, but I'm having problems using it so I need some help getting started and to figure out the settings.

Linux EDAC driver

From what I understand, Linux handles ECC RAM using a subsystem called EDAC and the controls for that are exposed under /sys/devices/system/edac/. I can see my two memory controllers here (2 node NUMA):

# ls /sys/devices/system/edac/mc/
mc0  mc1  power  subsystem  uevent

I can also see that the EDAC drivers are somehow loaded:

# edac-util --status
edac-util: EDAC drivers are loaded. 2 MCs detected
# lsmod | grep edac
amd64_edac_mod         36864  0
edac_mce_amd           28672  1 amd64_edac_mod

Now I want to enable scrubbing. According to the Linux ABI documentation the scrub rate is exposed through the /sys/devices/system/edac/mc/mc*/sdram_scrub_rate file, documented as such:

The scrubbing rate used by the memory controller is set by
writing a minimum bandwidth in bytes/sec to the attribute file.
The rate will be translated to an internal value that gives at
least the specified rate.
Reading the file will return the actual scrubbing rate employed.
If configuration fails or memory scrubbing is not implemented,
the value of the attribute file will be -1.

But nothing happens when I do this. Writing a sensible value (somewhere in the middle when checking the source and the CPU documentation) to the file seems to work but it always returns 0 when reading from it:

# cat /sys/devices/system/edac/mc/mc0/sdram_scrub_rate
0
# echo 1000000 >/sys/devices/system/edac/mc/mc0/sdram_scrub_rate
# echo $?
0
# cat /sys/devices/system/edac/mc/mc0/sdram_scrub_rate
0

After digging this deep, what am I missing?

BIOS ECC Configuration

I have also tried different settings in the BIOS. There is an option in BIOS for ECC configuration, but none of them has any effect on the scrub rate visible from linux:

enter image description here

Right now I'm trying the User setting but I really can't see any difference between these.

Best Answer

This is a kernel bug

This is exactly how one controls the settings, but there is a bug in the kernel that causes the readout from the hardware to always return 0 for this CPU.

A patch to fix it has been queued but I do not know when it will trickle down into the main kernel. I may update the answer when it happens.

With the patch applied, the output from the commands used in the question is then:

# echo 1000000 >/sys/devices/system/edac/mc/mc0/sdram_scrub_rate
# echo $?
0
# cat /sys/devices/system/edac/mc/mc0/sdram_scrub_rate
781440

781440 being the number of bytes that is scrubbed by the memory controller mc0 every second, quantized to the nearest possible value from the requested 1000000.

Related Question