Memory Issues – ECC Errors Without MemTest Errors Explained

eccmemorymemtest86+motherboard

I recently purchased an HP XW6400 workstation (Dual CPU, Quad memory channels) Along with the computer I purchased 2 sticks of RAM that are the same brand and look, but not matching (numbers and secondary stickers don't match, but they were suppose to be matching) and 2 Xeon 5160 CPU's. After putting it all together I had regular ECC corrections that were noted on start up so I bought more RAM sticks that were matching; after installing the next set of memory I got the same errors. So I bought a motherboard and I still got the same errors. The memory controller is not integrated into the processor so I have't paid to much attention to them. I run memtest for a quick 2 hour run on each stick individually and no errors come up on any of the sticks. But I still get ECC corrections on many reboots. Some times it notes it corrected the errors, other times they are a fatal uncorrectable errors.

They are warm little things, so I turned the fan right above them over so the fan blows at them. Northbridge is cooled by a fan as well. Temps via hardware monitor all seem normal.

Further, if I put all 4 sticks in, it will lock up within minutes of starting almost every time. Where with 2 sticks, it almost never locks up (used it for 2 weeks before I bought a new board); it just notes the ECC corrections or errors on reboot.

All memory is DDR2 5300F Fully buffered ECC memory.

The first set is HP memory, but by the numbers and stickers, they are not a matched pair, but at first glance they look the same. most of the numbers are equivalent too. But they are manufactured in different parts of the world (Singapore and Puerto Rico)

The second set is Kingston memory but it is a matched pair.

My hypothesis is that the Kingston memory is having compatibility issues in dual channel mode, and the HP memory is not a matched set which causes issues in dual compatibility mode, and all four together is a compatibility nightmare for quad channels so it locks up. But really, I am just stabbing in the dark. Any ideas?

Best Answer

I think there was a bad BIOS and a bad CPU working in conjuction with each other and I think the memory, while not ideal, is not really the major issue. hence the stabbing in the dark comment.

In the past, I had a CPU Front Side Bus Error intermittently that I was attributing to memory or motherboard issues. I found an HP document that says the original BIOS actually has issues and to update so I updated the BIOS.

Then things ran a little better in that I could run with all 4 sticks of memory without crashing, So next I tried troubleshooting the CPUs by running a multitask "test" from passmark on the system which wrote to the memory, and ran prime numbers, and ran the dry and whetstone test all at the same time. Before that, during all the fiddling, I purposefully had swapped CPU locations just in case the FSB error came up again. It very quickly BSOD'd the computer and would not simply restart. Upon restart, (after having a a hell of a time to get it to restart) it gave me a new error message for the CPU for a front side bus error and an additional sub error for the FSB on the same CPU as the one that had the FSB error in the past (different socket). Plus the computer would freeze while poking around in the BIOS and I could not get it to boot into windows. So I removed the suspected bad CPU, restarted, which worked, and ran the same test again but for longer. No crash, no errors (yet) and every thing so far seems stable.

Sometimes you win with used stuff, sometimes you lose. I think this officially is one of those losing moments in how much time this has all wasted. Lets just hope that's it for problems.

Related Question