Linux Networking – Troubleshoot Strange Temporary Network Outage

linuxlinux-kernelnetworking

I'm facing a very annoying problem that I noticed a week from now and for which I can't find an answer: my network suddenly stops responding, usually coming back exactly 25 seconds later. I was using kernel 3.10.4 and now migrated to 3.11-rc4 to see if something changed, but no, the behavior is the same. And since it is a hard to spot problem due to the fact usual web surfing is in "bursts" and the outage is completely random, I can't really tell this problem was present in a previous kernel as well (I always use custom but unpatched kernels from kernel.org, all compiled by myself)

I can't tell the kernel is the culprit either, but I can say there are no clues on the system logs (I checked both /var/log/syslog and /var/log/messages and there is nothing unusual there) and that hardware doesn't seem at fault, for the problem shows up using either one of my network cards:

lspci output:

02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5751 Gigabit Ethernet PCI Express (rev 01)
04:00.0 Ethernet controller: 3Com Corporation 3c905B 100BaseTX [Cyclone] (rev 30)

and I already tried to exchange the ethernet switch ports and still no one else where I work has a problem except me (although we use similar machines, I'm the only one using Linux, so I had to take some infamous jokes about it as well… hehe).

I ran up wireshark on my machine and left it continuously pinging our gateway and another machine on the same network segment. Then, at the first sign of network malfunction I would check it and verify the gateway stopped responding pings, but the other machine was still there responding normally. Some other times is the other machine which stops responding and the gateway is fine, and some other times both stop responding. I don't know what else to do, so I'd like some help or tips on how to further debug this, since the system logs are completely normal.

I have my kernel config file and a capture file from wireshark showing the situation. I can post here or at some pastebin site in case anyone finds it useful to understand the case, just please let me know the detail level I should use (I guess the packet level without the raw data would be enough).

Best Answer

The symptoms are consistent with an IP address conflict. An IP address conflict arises when your machine and some other machine on the same network are trying to use the same IP address.

On a local link network, addressing is based on MAC addresses. Every Ethernet card has its own MAC address (barring gross misconfiguration or malice). A router deciding where to send an IP packet will send an ARP request for the target IP address on all its ports. That message is sometimes known as “who has”: the router is trying to find out which of its peers is responsible for this IP address. Once the router receives a reply containing a MAC address, it can build and send an Ethernet frame (Ethernet packet) containing the IP packet to that MAC address. Since this exchange takes a while, the router keeps a cache of recent ARP information. (There are other types of ARP messages, but what I've explained here is sufficient to understand the present issue.)

So in a nutshell, routers need to know what physical device have each IP address that they're sending IP packets to. So what happens when there are two devices claiming the same IP address? The router receives a reply from one of the devices, and from then on it decides that this IP address belongs to that device, until the corresponding cache entry expires. After the cache entry expires, the router will send a new ARP request, and maybe the other device will reply faster this time. This explains why such situations are unstable: one minute the router is talking to you, the next minute it's talking to the other guy.

If you continuously ping someone, then the router keeps your IP address in its ARP cache pretty much all the time. So while you're pinging, there's only a small window during which the other guy can replace you in the cache (after your cache entry expires, before the next ping comes). That's why observing the problem makes it mostly go away, which can be frustrating until you realize what the problem might be.

In your case, it looks like your local router keeps entries in its cache for 25 seconds. When you're in the cache, you're good for 25 seconds. Then sometimes the other guy comes, at random-looking moments, and you're out of it for 25 seconds.

When you try to contact multiple machines on the same local link, each has its own ARP table, so you may observe inconsistent results, with one machine deciding that you own the IP address and another machine deciding that the other guy does.

High-end routers log IP address conflicts, so if you think you're encountering one, enlist the help of your system administrator. Make sure first that it isn't your machine that's trying to use an IP address that it shouldn't be using!

Related Question