Debian – How to test hardware components to find out which one is bad

debianhardwaresamba

Question

How do I test hardware components to find out which one is bad?

Details

I have an old machine running debian as a file server using samba. The other day I was unable to login to my file server. When I looked at the screen on my debian server this is what I saw:

enter image description here

It says its a hardware error and kinda looks like it's a bad CPU. However, I don't want to run out and buy a new CPU because I really have no idea what I am talking about.

Here is what I have done:

  • I tested the memory using memtest 86+ for 66 hours straight. It passed 65 times and found 0 errors. So I think bad memory is out of the question. However, I was kinda curious why it didn't crash during those 66 hours if there was some other error on the system.
  • I noticed it said java Tainted so I thought it might be a java issue. I disabled CrashPlan Backup service since it uses java. The server ran great for 4 days. (Usually it crashed every 15-30 minutes) During the time while I had crashplan off I had two computers connect to the server, get 50 GB of HD video, encode it and place it back on the servers hard drives. Didn't have any issues. Then a day later it crashed again.

Should I just assume it's a CPU issue since it mentions that?

How do I test hardware components to find out which one is bad?

Best Answer

If your hardware is from a big vendor, say HP, Dell and so, they might have specific tools for what you're looking. I use to work with HP and they already have tools for reporting bad hardware.

If that's not the case, then it'll be trickier (based on my experience), you already started well testing memory as it uses to be a usual failing point.

Now if you doubt from you CPUs, you could expose them to an intensive job like compiling a kernel or any other big source like libreoffice, xorg, etc...you can use CPU affinity if you've got several CPUs.

Also, the error message is suggesting you to run "mcelog --ascii", you could do it and try to understand the messages, please read both links below I hope they'll help you with your HW problem:

http://mcelog.org/faq.html#5

http://www.gentoo.org/doc/en/articles/hardware-stability-p1.xml

Related Question