Linux – Hard freeze stops physical reset button from working

bootfreezehardware-failurelinuxmemory

I have a repurposed PC running as a server. It was assembled in early 2014 and contains an Intel Core i7-4770 on a Gigabyte Z87-HD3. It worked pretty reliably until early 2017 when it started to intermittently freeze (every few weeks to months). No Kernel logs, not even pstore crash data or netconsole did produce anything meaningful. Physical screen is blank, network non-responsive, metrics at 10s granularity shows no correlation to load on CPU, RAM or disk. All LEDs and drives are still running, but there is obviously no IO anymore. RAM has been tested and is verified good, no spurious segfaults or anything that would indicate an intermittent hardware problem. Just hard freezes.

Now on to the very interesting part: Once the system enters this state, the physical reset button stops working completely. Once I press it, nothing happens. It is definitely physically working since it works 100% when the system is not in that state. I checked voltages from the PSU with a multimeter and they are all fine. I can still reset the server by pressing the power button for 5s and it boots up fine after that.

So I'm pretty much at a loss what happens here and what piece of hardware is to blame. I have logic analyzers and I could get access to USB scopes, but nothing that samples above 100MSPS, so I can't probe the actual buses. I would be very grateful for any insights of what might be going on.

Best Answer

So after a lot of strategic swapping (mainboard, PSUs, CPU) I have a differential confirm (test system experiences the problem, original no longer does) on the CPU being bad. Very unexpected result since no MCEs were ever fired, usually you get MCEs way before hard lockups.

Since this board sadly doesn't have a Trace Hub / JTAG connector and the built-in USB3 debugging is not available on the Haswell platform I have no idea what is actually going wrong. It's pretty certain that the chip ends up in a state where it fails to be released from reset (self-test failure, power rail not coming up, ...). Could be related to the introduction of FIVR (Fully Integrated Voltage Regulator) in Haswell, but that's just speculation.

If you hit this problem, it doesn't need to be the CPU, it could just as well be a failing motherboard or PSU (or something else entirely). I just wanted to post this for completeness and for people to see that it can indeed also be a CPU fault (although it is still pretty unlikely).

Related Question