Linux – “Northbridge Error (node 0): ECC Error in the Probe Filter directory”

ecchardwarelinux-kernel

I've received an e-mail from a user worried that the following errors on one of his servers is indicative of a serious problem. The trouble is, the errors below are all that I have to go on. I usually consider myself a decent Googler, but in this case I can only find one other incident where the users encountered this error regarding "Probe Filter directory":

[1044 snapshots @ abc]$
Message from syslogd@abc at Sep  8 02:51:51 ...
  kernel:[Hardware Error]: CPU:0 
MC4_STATUS[Over|CE|MiscV|-|AddrV|-|Poison|CECC]: 0xdc0248d0001f010b

Message from syslogd@abc at Sep  8 02:51:51 ...
  kernel:[Hardware Error]:       MC4_ADDR: 0x0000000000010f40

Message from syslogd@abc at Sep  8 02:51:51 ...
  kernel:[Hardware Error]: Northbridge Error (node 0): ECC Error in the 
Probe Filter directory.

Message from syslogd@abc at Sep  8 02:51:51 ...
  kernel:[Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: GEN

From what I can tell, this only happened once. Grepping around the logs for other hardware errors turns up nothing other than this one incident.

The forum post which I reference above simply ends with basically telling the user not to worry about it if it only happened once and didn't cause any fatal issues. This is the same advice I got from my colleagues, who also mentioned that there are too many variables (i.e. what was running at 2:50am on September 8th?).

However this user wants to be reassured that something isn't wrong with their system. What can the above errors indicate or be related to? What is the "Probe Filter directory?" What tests can I run to put the user at ease that this doesn't flag their machine for impending doom?

The Linux distribution of the machine is Red Hat Enterprise Linux Server release 6.4 (Santiago).

Best Answer

I don't have a precise answer, but some of this is familiar. I don't know what a Probe Filter directory is, but CptSupermrkt explained that above.

In PCI, a Northbridge connects to memory and the processor. ECC errors are associated with DRAM. There are Error Correcting Code bits stored along with each word. On reads they're checked on writes they're updated. ECC Errors are correctable or uncorrectable, which indicate the ability to correct an error using the bits written. Uncorrectable does not indicate there is a permanent hardware error. These can happen when DRAM starts to fail.

Given all that, this looks like a transient error. You might try a complete memory test, but that's not likely to find anything. If the DRAM has failed your only corrective action is to replace it.

Related Question