RAM tests inconsistently – what is the most likely culprit? (i.e. what should I spend money on replacing)

hardware-failurememory

  • Motherboard: GA-B85M-DS3H-A
  • CPU: Core i5 4430
  • RAM: PNY XLR8 DDR3 32GB (4x8GB) 1600MHz (MD32768K4D3-1600-X9)
  • PSU: EVGA 500 W1 80+

The Problem

With all 32GB of RAM installed, the system fails MemTest86+ 6.2 consistently. The failure always occurs during the first pass, and the errors quickly rise in to the millions of errors. Attempting to run Windows results in random reboots and Stop errors (as would be expected with RAM errors).

What I've Tried

  • Test a single 8GB PNY module in socket DIMM1. Successfully completes 4 passes of MemTest.
  • Test a single 8GB PNY module in socket DIMM2. Successfully completes 4 passes of MemTest.
  • Test a single 8GB PNY module in socket DIMM3. Successfully completes 4 passes of MemTest.
  • Test a single 8GB PNY module in socket DIMM4. Successfully completes 4 passes of MemTest.
  • Test all four 8GB PNY DIMMs separately, individually, in socket DIMM1. All modules successfully complete 4 passes of MemTest.
  • Test two 8GB PNY modules in sockets DIMM1 and DIMM2. Successfully completes 4 passes of MemTest.
  • Test two 8GB PNY modules in sockets DIMM3 and DIMM4. Successfully completes 4 passes of MemTest.
  • Test the motherboard with four 2GB known-good DIMMs in all sockets. Successfully completes 4 passes of MemTest.
  • Swap the ordering of the PNY DIMMs in the sockets. No change – MemTest errors still occur.
  • Raise the motherboard RAM voltage from 1.5v to 1.65V. No change – MemTest errors still occur.
  • Play with various combinations of the RAM manual settings in the setup utility – enabling/disabling XMP profile, setting "increased stability" preset, etc. No change, MemTest errors still occur.

I think I can safely rule out bad RAM and bad RAM sockets. The only time the MemTest tests fail is if all four 8GB modules are installed simultaneously.

I've measured voltages coming off the PSU and everything there appears stable even with all four sticks installed.

As I write this, I have tried a last resort option of manually reducing the RAM speed to 1066MHz in the BIOS. So far, MemTest has completed one pass and is on its second with no errors. (All the above tests were performed at the native 1600MHz RAM speed.) This may allow me to use the system, albeit with slightly slower RAM speeds, but this does not seem to be a permanent fix.

Whenever MemTest errors occur, they always occur in the same exact position on the 64-bit address bus:

Bit Error Mask: 00000000FF000000

Additionally, errors NEVER occur below the 4GB barrier. In other words, all errors occur in the address space between 4GB and 32GB.

I'm deducing this to be some sort of strange interaction or timing problem with the CPU and the RAM and the motherboard, since the errors are very consistent, only occur in one specific configuration, appear to be mitigated by slowing down the RAM, and only occur above the 4GB barrier. My question is: Is it more likely that my CPU or my motherboard is the culprit?

I have been intending to upgrade this machine to a Core i7-4790K, so if the CPU is the likely culprit (I know that the memory controller is on the CPU in these newer models) then it works out good because I am planning to upgrade it anyway, but I'm wondering if there's a chance that the motherboard itself might also be part of the problem. i.e. I would not want to spend the money on the i7 CPU only to experience the exact same problem and find out I also have to replace the motherboard…

Advice?


EDIT: The slower RAM speed still produced errors, but only once the test reached the third pass. I restarted the test with only one CPU active just to test for an interaction on the CPU itself.

Best Answer

This doesn't sound like any component is defective, rather you are using an incompatible combination.

Having multiple sockets on the same memory bus populated increases the capacitance on each data line and slows down the rise time, which can cause transitions to arrive late and be misdetected. This phenomenon is known to electrical engineers as "fan-out".

This is further complicated because of the fan-out internal to a memory module. The number and topology of the DRAM devices on the module, called "rank", will affect how many modules you can successfully connect in parallel.

Server motherboards supporting a lot of memory sockets actually require buffered memory, which uses a cascading network of buffers to limit the fan-out (and therefore capacitance) seen by each one. There's delay caused by the buffers themselves, but it only increases logarithmically with the number of loads, whereas for unbuffered memory capacitance increases linearly.

Wikipedia discusses this: https://en.wikipedia.org/wiki/Memory_rank

Some motherboard manuals actually call this sort of thing out. For others you can deduce the information from the RAM compatibility lists. As an example, the ASUS Z170-A motherboard shows that dual rank (called DS = double sided in the manual) can only be used in two slots at once on that board, as opposed to the ability to use four single rank DIMMs at once.

enter image description here

Related Question