“legal” ARP poisoning by machine aggregating 2 NICs crashes us

arpipnetworking

Strange things are afoot, threats are being made and we need to sort this problem out;

The situation:

Our device (a network camera) streams video over a network to a recorder/server (Using Live555 / WIS Streamer). The video is UDP packets.

On one particular site using one particular server, every so often (~24 hours) one thread of the Live555 streamer locks up whilst sending video. Other threads keep going, and we still have connectivity to the camera over IP – see web pages from it, PING it, etc.

We suspect: the server; it has 2 network ports and aggregates them – it has two MAC's but one IP address. On wiresharking this, we see the camera streaming to one port (let's call it A), we then get an ARP from the other port (let's call it B), our device stops squirting packets to MAC A, squirts one packet up the wire to MAC B and then appears to stop in its tracks.

Further info: The server seems to corrupt ARP packets from the "wrong" port, possibly as result of a misconfiguration or somesuch, but those packets still get read & acted upon by our device, possibly as a result of our driver or kernel networking being misconfigured or skipping checksums to save CPU cycles.

So this messy situation begs a few questions:

  1. Where in the kernel networking code should I be looking to check the packet checksum or enable checking? Our hardware is fixed, being an embedded device, so a tweak made to the driver is not the worst idea ever.
  2. Can anyone guess the failure mechanism that causes a process to lock up when it's constantly send()ing data on a port and the ARP tables shift underneath it?

Edited to add: We now suspect that the ARPs are not really corrupt, just that Wireshark is not correctly identifying the packet (it thinks the packet is long enough that there must be a FSC word, but we now think it's just zero-padding). That really just leaves part 2 of this question: what can we do to prevent this change in the ARP table knocking a transmitting process over?

Edit to further add: I don't want people to think I'm ignoring questions about port states or process states, the issue happens very rarely (average maybe once per 24h) and only on one (remote) installation that we can't easily get access to, we're trying hard to replicate it in the lab so we can do more detailed diagnostics but the system watchdog resets within ~3 mins of the problem occurring, so by the time the news reaches us it's already rebooted and started working OK.

Edit to add Wireshark info:
I'm not sure the best way to summarise wireshark captures here (very hard to upload ~1Tb of captured packets!) but I'll try. Cam:X & Cam:Y are two streams of RTSP video streamed by two identical instances of Live555 WIS Streamer from different ports. Server 'A' and 'B' are the MACs of the two NICs on the server.

The sequence of packets goes like this:

UDP Packet from Cam:X -> Server 'A'
UDP Packet from Cam:Y -> Server 'A'
UDP Packet from Cam:X -> Server 'A'
UDP Packet from Cam:Y -> Server 'A'
UDP Packet from Cam:X -> Server 'A'
UDP Packet from Cam:Y -> Server 'A'
ARP Packet to Cam from Server 'B' "<my IP> is now on 'B'"
Intel ANS Probe broadcast from Server 'B', Sender ID '1' team ID 'B'
Intel ANS Probe broadcast from Server 'A', Sender ID '2' team ID 'B'
<silence> from Cam:X
UDP Packet from Cam:Y -> Server 'B'
UDP Packet from Cam:Y -> Server 'B'
UDP Packet from Cam:Y -> Server 'B'

There are no other packets in the stream at or around this time. The Intel ANS packets do not always coincide with the ARPs from the NIC but I thought I'd include them for the sake of completeness.

The issue seems to be VERY sensitive to timing, we see these "team" ARPs regularly from the server and only once in a blue moon do they cause us an issue – as if there's a particular point in the network stack code that's sensitive to the ARP table changing. It's not always the same stream instance that falls over, and notably the other instance (as well as all other net traffic – HTTP etc.) continues to work fine.

It sounds like teamed NICs "should not" ARP like this mid-session, but of course they won't be aware of any session when the traffic is all UDP.

Best Answer

Well if only to give some closure to this the customer reconfigured their dodgy network card and everything worked, so unfortunately for the curious that means no-one is going to pay anyone to look too closely at what could've been done to fix that case.

Related Question