Why are the hard drives failing

hard drivetroubleshooting

I have a small Ubuntu server running at home, with 2 hard drives. There are two software raids (raid1) on the disks, managed by mdadm,
which I believe is irrelevant, but mentioning it anyway.

Both of the hard drives are Western Digital, and have been used for around 2 years, when one of them started making clicking noises, and died.
I figured that maybe it's natural after 2 years, so I bought a new one, and resynced the raid arrays. After about a month, the other drive also died.

I didn't get suspicious, since both drives have been bought at the same time, it's not that surprising to see both of them near each other, so I bought
another one.

So far, 2 old drives failed, and 2 brand new in the system. After one month, one of the new drives died. This is when it started getting suspicious.
Since the PC was put together from some really old parts (think AthlonXP), I figured that maybe the motherboard's SATA controller is the culprit.
Of course you can't switch parts easily in an old PC like this, so I bought a whole system, new MB, new CPU, new RAM. Took the just failed drive back,
since it was under warranty, and got it replaced.

So it is up to 2 failed drives from the old ones, and 1 failed drive from the new ones.
No problems, for 1 month. After that errors were creeping up again in /var/log/messages, and mdadm was reporting raid array failures. I started tearing my
hair out. Everything is new in the system, it's up to the third brand new hard drive, it's simply not possible that all of the new drives that I bought were faulty.

Let's see what is still common… the cables. Okay, long shot, let's replace the SATA cables. Take hard drive back, smile to the guy at the counter and say that
I'm really unlucky. He replaces the hard drive. I come home, one month passes and one of hard drives fails, again. I'm not joking.

Two of the brand new hard drives have failed.
Maybe it's a bug in the OS. Let's see what the manufacturer's testing tool says. Download testing tool, burn it to a CD, reboot, leave hard drive testing overnight.
Test says that the drive is faulty, and I should back up everything, if I still can. I don't know what's happening, but it does not look like a software problem,
something is definitely thrashing the hard drives.

I should mention now, that the whole system is in a shoebox. Since there are a load of "build your own ikea case" stuff, I thought there shouldn't be any
problems throwing the thing in a box, and stuffing it away somewhere. The box is well ventilated, but I thought that just maybe the drives were overheating.
There is no other possible answer to this. So I took the hard drive back, and got it replaced (for the 3rd time), and bought hard drive coolers.

And just now, I have heard the sound of doom. click click whizzzzzzzzz.
SSH into the box:

You have new mail!
mail
r 1
DegradedArrayEvent on /dev/md0 ...

dmesg output:

[47128.000051] ata3: lost interrupt (Status 0x50)
[47128.000097] end_request: I/O error, dev sda, sector 58588863
[47128.000134] md: super_written gets error=-5, uptodate=0
[48043.976054] ata3: lost interrupt (Status 0x50)
[48043.976086] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[48043.976132] ata3.00: cmd c8/00:18:bf:40:52/00:00:00:00:00/e1 tag 0 dma 12288 in
[48043.976135] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[48043.976208] ata3.00: status: { DRDY }
[48043.976241] ata3: soft resetting link
[48044.148446] ata3.00: configured for UDMA/133
[48044.148457] ata3.00: device reported invalid CHS sector 0
[48044.148477] ata3: EH complete

Recap:

  1. No possibility of overheating
  2. 6 drives have failed, 4 of those have been brand new. I'm not sure now that the original two have been faulty, or suffered the same thing that the new ones.
  3. There is nothing common in the system, apart from the OS which is Ubuntu Karmic now (started with Jaunty). New MB, new CPU, new RAM, new SATA cables.
  4. No, the little holes on the hard drive are not covered

I'm crying. Really. I don't have the face to return to the store now, it's not possible for 4 drives to fail under 4 months.

A few ideas that I have been thinking:
Is it possible that I mess up something when I partition and resync the drives? Can it be so bad that it physically wrecks the drive? (since the vendor supplied tool
says that the drive is damaged)
I do the partitioning with fdisk, and use the same block size for the raid1 partitions (I check the exact block sizes with fdisk -lu)

Is it possible that the Linux kernel or mdadm, or something is not compatible with this exact brand of hard drives, and thrashes them?

Is it possible that it may be the shoebox? Try placing it somewhere else? It's under a shelf now, so humidity is not a problem either.
Is it possible that a normal PC case will solve my problem (I'm going to shoot myself then)? I will get a picture tomorrow.

Am I just simply cursed?

Any help or speculation is greatly appreciated.

Edit:
The power strip is guarded against overvoltage.

Edit2:
I have moved inbetween these 4 months, so the possibility of the cause being "dirty" electricity in both places, is very low.

Edit3:
I have checked the voltages in the BIOS (couldn't borrow a multimeter), and they are all seem correct, the biggest discrepancy is in the 12V, because it's supplying 11.3. Should I be worried about that?

Edit4:
I put my desktop PC's PSU into the server. The BIOS reported much more accurate voltage readings, and also it has successfully rebuilt the raid1 array, which took some 3-4 hours, so I feel a little positive now. Will get a new PSU tomorrow to test with that.
Also, attaching the picture about the box: (disregard the 3rd drive)

picture of box of doom

Best Answer

Is your power supply old too? Perhaps its under/overpowering the drive which is causing the failure. If you have a multimeter, I would try measuring the voltage that is running in your hard drives and watch it over a period of time. Another culprit may be 'dirty' electricity, so a UPS may be in order so that it will 'clean' the power going into the PSU.

Related Question