Debian Networking – Fix Packet Dropping on Debian 6.0 with Sandy Bridge Hardware

debiankernellinuxnetworking

I recently migrated an existing Debian system to new hardware, a core i3 chip running on an intel sandy bridge motherboard. I’m experiencing a very strange problem; when I ping my router, about 50% of the packets are getting dropped.

I spent some time testing, and can verify it’s not the router. It works fine with multiple different machines, even when connected to the same Ethernet port on the router.
The pings that do come back have very low latency, less than 1 ms, as you’d expect from the router sitting across the room.

I am using kernel 2.6.39, on Debian stable (I got the kernel from backports). Other than the kernel and a few related packages needed to get it going, the system is 100% Debian 6.0. The kernel detects the network hardware and loads the e1000e driver on boot. There is nothing strange in the logs.

One other thing: in spite of the problem, the networking "works" if you can call it that. What I mean is I can also ping yahoo and google successfully. Of course I also lose ~ 50% of the packets in these cases too, but some packets are coming back. The other devices connected to this router are all working fine. I am typing this on a machine connected to the same router.

I am relatively experienced in Linux, but not sure where to even start with this issue.
The only other thing I can think of is that the router is 10/100, not gigabit. Obviously that shouldn’t cause this issue, but maybe it’s related? OTOH, I’m pretty sure the last machine had gigabit Ethernet too. It was plugged into the same port on the same router.

Yes, I’ve tried rebooting the router, and the machine, multiple times.

I’m hoping someone here will have an idea.


UPDATE: @bdk makes some some good suggestions… wish I had good news! 🙁

I tried a bunch more things, and got nowhere. I also grabbed some output from the system to include here.

sometimes when I try to ping it can't find the host at all. if I try it again, it can connect. I assume this is just the first ping(s) failing. @bdk, the failures seem intermittent, at least I cannot see a pattern.

Here are the relevant lines from dmesg, am I missing some red flag?

[    1.171187] e1000e: Intel(R) PRO/1000 Network Driver - 1.3.10-k2
[    1.171190] e1000e: Copyright(c) 1999 - 2011 Intel Corporation.
[    1.171225] e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20
[    1.171236] e1000e 0000:00:19.0: setting latency timer to 64
[    1.171339] e1000e 0000:00:19.0: irq 42 for MSI/MSI-X
[    1.460976] e1000e 0000:00:19.0: eth0: (PCI Express:2.5GB/s:Width x1) e0:69:95:dd:5d:d9
[    1.460979] e1000e 0000:00:19.0: eth0: Intel(R) PRO/1000 Network Connection
[    1.461015] e1000e 0000:00:19.0: eth0: MAC: 10, PHY: 11, PBA No: FFFFFF-0FF
[   48.475222] e1000e 0000:00:19.0: irq 42 for MSI/MSI-X
[   48.530979] e1000e 0000:00:19.0: irq 42 for MSI/MSI-X
[   50.120859] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
[   50.120863] e1000e 0000:00:19.0: eth0: 10/100 speed: disabling TSO

things I tried that did not help:

installed linux-firmware-free, linux-firmware-nonfree, in case there was a better firmware available (there wasn't, or at least the kernel didn't find it)

played with aspm in the BIOS, others have reported aspm causing problems for e1000e ethernet (didn't help)

completely disabled pcie_aspm in the kernel, in case that was causing the problem (it wasn't, but disabling it did introduce new problems)

mii-tool is apparently not supported by this chip? is there a special intel tool to use instead?

when I took a look at tcpdump, things started looking more grim. not only are some of the packets not making it back, some aren't even making it out!

14:25:01.162331 IP debian.local > 74.125.224.80: ICMP echo request, id 2334, seq 1, length 64
14:25:02.168630 IP debian.local > 74.125.224.80: ICMP echo request, id 2334, seq 2, length 64
14:25:02.228192 IP 74.125.224.80 > debian.local: ICMP echo reply, id 2334, seq 2, length 64
14:25:07.236359 IP debian.local > 74.125.224.80: ICMP echo request, id 2334, seq 3, length 64
14:25:07.259431 IP 74.125.224.80 > debian.local: ICMP echo reply, id 2334, seq 3, length 64
14:25:31.307707 IP debian.local > 74.125.224.80: ICMP echo request, id 2334, seq 9, length 64
14:25:32.316628 IP debian.local > 74.125.224.80: ICMP echo request, id 2334, seq 10, length 64
14:25:33.324623 IP debian.local > 74.125.224.80: ICMP echo request, id 2334, seq 11, length 64
14:25:33.349896 IP 74.125.224.80 > debian.local: ICMP echo reply, id 2334, seq 11, length 64
14:25:43.368625 IP debian.local > 74.125.224.80: ICMP echo request, id 2334, seq 17, length 64
14:25:43.394590 IP 74.125.224.80 > debian.local: ICMP echo reply, id 2334, seq 17, length 64
14:26:18.518391 IP debian.local > 74.125.224.80: ICMP echo request, id 2334, seq 30, length 64
14:26:18.537866 IP 74.125.224.80 > debian.local: ICMP echo reply, id 2334, seq 30, length 64
14:26:19.519554 IP debian.local > 74.125.224.80: ICMP echo request, id 2334, seq 31, length 64
14:26:20.518588 IP debian.local > 74.125.224.80: ICMP echo request, id 2334, seq 32, length 64
14:26:21.518559 IP debian.local > 74.125.224.80: ICMP echo request, id 2334, seq 33, length 64
14:26:21.538623 IP 74.125.224.80 > debian.local: ICMP echo reply, id 2334, seq 33, length 64
14:26:37.573641 IP debian.local > 74.125.224.80: ICMP echo request, id 2334, seq 35, length 64
14:26:38.580648 IP debian.local > 74.125.224.80: ICMP echo request, id 2334, seq 36, length 64
14:26:38.602195 IP 74.125.224.80 > debian.local: ICMP echo reply, id 2334, seq 36, length 64

notice the request sequence, how it goes 1, 2, 3… 9??! that can't be good.

I know Sandy Bridge is still relatively new, but Linux does work… right?

Could this be bad hardware? No way… right?

sigh…. maybe I should just go back to the old system.

Best Answer

Apparently this issue is already known to the Ubuntu folks. Got to hand it to 'em!

For starters: the quick work around. You can get your system running again by slowing down the ethernet to 10 mpbs like this:

sudo ethtool -s eth0 speed 10 autoneg off

(Note the mii-tool does NOT work with this ethernet chip)

I actually don't have a confirmed fix yet, but apparently no one does. I chose to answer this question because the nature of this problem is something people need to be aware of.

According to the Ubuntu bug report, this is a hardware fault that randomly affects only some recent Intel ethernet chips. Not some models, but certain chips. Meaning there's no way to tell which ones are good and which aren't. At a minimum, the 82579V (my chip) and the 82579LM are affected, Ubuntu team has confirmed those. Who knows how many other models are affected.

It may be wise to avoid motherboards that use Intel ethernet chips, at least until the extent of the problem is fully understood.

So it appears this actually is a hardware bug, after all. There are rumors that you can download, compile, and install the latest intel driver, which contains a permanent software workaround. The download is here, compile and install are left as an exercise for the reader.

I'm curious what this software workaround is, and whether it permanently reduces any functionality or performance. There must be some tradeoff, right? Unfortunately I was unable to experiment with this myself, since I needed to get this motherboard sent back within the return window.

Ubuntu bug reports be found here and here. Many thanks to the awesome Ubuntu team! They really do great things for Linux hardware compatibility.

What surprised me most about this is that I was apparently among the first to come across this issue. The Ubuntu bug reports above are still active as of this writing. Is no one using Linux on Sandy Bridge yet? Am I the only person left on the planet with 10/100 network hardware? Perhaps the most likely reason is that the Intel ethernet hardware problem only recently manifested itself.

-- Eric