Linux – Forcing Ping to Egress When Destination Interface is Local (Debian)

linuxlinux-kernellxcnetworkingtcpip

I am running a Debian-based Linux container under Proxmox 4.4. This host has five network interfaces (though only two come into play in the problem I'm having).

While I am shelled into this host, I ping the IP address associated with eth1. What is happening and what I believe should happen are two very different things.

What I want to happen is for the ping packet to egress eth3, where it will be routed to eth1.

What is happening is that the IP stack sees I'm pinging a local interface and it then sends the reply right back up the stack. I know the packet is not going out and coming back for two reasons:

  1. A packet capture shows nothing hitting either eth1 or eth3.
  2. The ping latency averages 0.013 ms. If the packet were going out and back as intended, the latency would be about 60 ms.

Of course, I desire corresponding behavior when I ping the IP address associated with eth3. In that case, I want the packet to egress eth1 where it will be routed to eth3. Unfortunately, similar behavior as described above happens.

Below, I show the static routes I've set up to try and induce the desired behavior. Such routes work as intended on a Windows machine, but they do not work under the Linux setup I am using.

How may I configure this host to forward as intended?

root@my-host:~# uname -a
Linux my-host 4.4.35-1-pve #1 SMP Fri Dec 9 11:09:55 CET 2016 x86_64 GNU/Linux
root@my-host:~#
root@my-host:~# cat /etc/debian_version
8.9
root@my-host:~#
root@my-host:~# ifconfig
eth0      Link encap:Ethernet  HWaddr xx:xx:xx:xx:xx:xx
          inet addr:192.0.2.65  Bcast:192.0.2.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:195028 errors:0 dropped:0 overruns:0 frame:0
          TX packets:12891 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:92353608 (88.0 MiB)  TX bytes:11164530 (10.6 MiB)

eth1      Link encap:Ethernet  HWaddr xx:xx:xx:xx:xx:xx
          inet addr:128.66.100.10  Bcast:128.66.100.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:816 errors:0 dropped:0 overruns:0 frame:0
          TX packets:486 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:149517 (146.0 KiB)  TX bytes:34107 (33.3 KiB)

eth2      Link encap:Ethernet  HWaddr xx:xx:xx:xx:xx:xx
          inet addr:203.0.113.1  Bcast:203.0.113.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:738 errors:0 dropped:0 overruns:0 frame:0
          TX packets:880 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:423603 (413.6 KiB)  TX bytes:94555 (92.3 KiB)

eth3      Link encap:Ethernet  HWaddr xx:xx:xx:xx:xx:xx
          inet addr:128.66.200.10  Bcast:128.66.200.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:611 errors:0 dropped:0 overruns:0 frame:0
          TX packets:182 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:43921 (42.8 KiB)  TX bytes:13614 (13.2 KiB)

eth4      Link encap:Ethernet  HWaddr xx:xx:xx:xx:xx:xx
          inet addr:198.51.100.206  Bcast:198.51.100.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:183427 errors:0 dropped:0 overruns:0 frame:0
          TX packets:83 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:85706791 (81.7 MiB)  TX bytes:3906 (3.8 KiB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:252 errors:0 dropped:0 overruns:0 frame:0
          TX packets:252 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1
          RX bytes:22869 (22.3 KiB)  TX bytes:22869 (22.3 KiB)
root@my-host:~#
root@my-host:~# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
192.0.2.0       0.0.0.0         255.255.255.0   U     0      0        0 eth0
128.66.100.0    0.0.0.0         255.255.255.0   U     0      0        0 eth1
203.0.113.0     0.0.0.0         255.255.255.0   U     0      0        0 eth2
128.66.200.0    0.0.0.0         255.255.255.0   U     0      0        0 eth3
198.51.100.0    0.0.0.0         255.255.255.0   U     0      0        0 eth4
root@my-host:~#
root@my-host:~# route -v add 128.66.200.10/32 gw 128.66.100.1
root@my-host:~# route -v add 128.66.100.10/32 gw 128.66.200.1
root@my-host:~#
root@my-host:~# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
192.0.2.0       0.0.0.0         255.255.255.0   U     0      0        0 eth0
203.0.113.0     0.0.0.0         255.255.255.0   U     0      0        0 eth2
198.51.100.0    0.0.0.0         255.255.255.0   U     0      0        0 eth4
128.66.100.0    0.0.0.0         255.255.255.0   U     0      0        0 eth1
128.66.100.10   128.66.200.1    255.255.255.255 UGH   0      0        0 eth3
128.66.200.0    0.0.0.0         255.255.255.0   U     0      0        0 eth3
128.66.200.10   128.66.100.1    255.255.255.255 UGH   0      0        0 eth1
root@my-host:~#
root@my-host:~# ping -c 3 128.66.100.10
PING 128.66.100.10 (128.66.100.10) 56(84) bytes of data.
64 bytes from 128.66.100.10: icmp_seq=1 ttl=64 time=0.008 ms
64 bytes from 128.66.100.10: icmp_seq=2 ttl=64 time=0.014 ms
64 bytes from 128.66.100.10: icmp_seq=3 ttl=64 time=0.017 ms

--- 128.66.100.10 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1998ms
rtt min/avg/max/mdev = 0.008/0.013/0.017/0.003 ms
root@my-host:~#

THURSDAY, 8/17/2017 8:12 AM PDT UPDATE

Per the request of dirkt, I am elaborating on our architecture and the reason for my question.

The virtual host that is the subject of this post (i.e. the host with network interfaces eth1, eth3, and three other network interfaces unrelated to my question), is being used to test a physical, wired TCP/IP networking infrastructure we have set up. Specifically, it is the routing functionality of this TCP/IP networking infrastructure that we are testing.

We used to have two virtual hosts, not one as I've described in my original post. A ping between these two hosts would be our smoke test to ensure that the TCP/IP networking infrastructure under test was still working.

For reasons too detailed to get into, having two hosts made it difficult to collect the logs we need to. So, we switched to one host, gave it two NICs, set up static routes so that anything destined for NIC 2 would egress NIC 1 and vice versa. The problem is, as I've stated, they're not egressing.

This one host / two NIC setup has worked under Windows for us for years. I don't know if that is because Windows is broken and we were inadvertently taking advantage of a bug, or if Windows is fine (i.e. RFC-compliant) and we just need to get the configuration right on our Linux VMs to get the same behavior.

To summarize and distill down the long block of shell text above:

Two Interfaces:

eth1: 128.66.100.10/24; the router on this interface's network has IP address 128.66.100.1
eth3: 128.66.200.10/24; the router on this interface's network has IP address 128.66.200.1

Relevant Routes:

Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
128.66.100.0    0.0.0.0         255.255.255.0   U     0      0        0 eth1
128.66.100.10   128.66.200.1    255.255.255.255 UGH   0      0        0 eth3
128.66.200.0    0.0.0.0         255.255.255.0   U     0      0        0 eth3
128.66.200.10   128.66.100.1    255.255.255.255 UGH   0      0        0 eth1

Command I'm Executing:

ping -c 3 128.66.100.10

The destination of 128.66.100.10 matches two of the above routes:

Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
128.66.100.0    0.0.0.0         255.255.255.0   U     0      0        0 eth1
128.66.100.10   128.66.200.1    255.255.255.255 UGH   0      0        0 eth3

The route with the longest prefix match is:

Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
128.66.100.10   128.66.200.1    255.255.255.255 UGH   0      0        0 eth3

What I am trying to understand is why, given the existence of this route, the packet won't egress eth3, travel through our TCP/IP networking infrastructure, come back and hit eth1 from the outside.

The TCP/IP stack is apparently not consulting the forwarding table. It's as if when it sees that I'm pinging a locally-connected interface, the TCP/IP stack just says, "Oh, this is local interface. So, I'm not going to consult the forwarding table. Instead, I'll just send an echo reply right back up the stack".

Is the behavior I desire RFC-compliant? If it is not, I must abandon the attempt. But if it is RFC-compliant, I would like to learn how to configure the Linux TCP/IP stack to allow this behavior.

MONDAY, 8/21/2017 UPDATE

I've discovered the sysctl rp_filter and accept_local kernel parameters. I have set them as follows:

root@my-host:~# cat /proc/sys/net/ipv4/conf/eth1/accept_local
1
root@my-host:~# cat /proc/sys/net/ipv4/conf/eth3/accept_local
1
root@my-host:~# cat /proc/sys/net/ipv4/conf/all/accept_local
1
root@my-host:~# cat /proc/sys/net/ipv4/conf/default/accept_local
1
root@my-host:~# cat /proc/sys/net/ipv4/conf/eth1/rp_filter
0
root@my-host:~# cat /proc/sys/net/ipv4/conf/eth3/rp_filter
0
root@my-host:~# cat /proc/sys/net/ipv4/conf/all/rp_filter
0
root@my-host:~# cat /proc/sys/net/ipv4/conf/default/rp_filter
0

Setting this kernel parameters, rebooting, verifying they survived the reboot, and testing again showed no difference in behavior.

Please note that my-host is an lxc Linux container running under Proxmox 4.4. I have also set rp_filter and accept_local as shown above on the hypervisor interfaces that corresponds to the eth1 and eth3 interfaces on my-host.

To re-summarize my objective, I have a Linux host with two NICs, eth1 and eth3. I am trying to ping out eth1, have the ping packet get routed through a TCP/IP network infrastructure under test, and make its way back to eth3.

Nothing I've tried above has allowed me to do so. How may I do so?

8/27/2017 UPDATE

Per a note by dirkt that I had failed to mention if eth1 and eth3 are purely virtual or if they correspond to a physical interface… eth1 and eth3 both correspond to the same physical interface on the hypervisor. The intent is that a packet that egresses eth1 actually physically leave the hypervisor box, go out onto a real TCP/IP network, and get routed back.

8/27/2017 UPDATE #2

Per dirkt, I have investigated network namespaces as it seemed quite promising. However, it doesn't "just work".

I am using LXC containers, and it seems that some of the isolation mechanisms present in containers are preventing me from creating a network namespace. Were I not running in a container, I think I'd have no problem adding the network namespace.

I am finding some references to making this work in LXC containers, but they are quite obscure and arcane. Not there yet, and have to throw in the towel for today… Should anyone have any suggestions in this regard, please advise…

Best Answer

(I'll leave the other answer because of the comments).

Descripton of the task: Given a single virtual host in a LXC container with two network interfaces eth1 and eth3, which are on different LAN segements and externally connected through routers, how can one implement a "boomerang" ping that leaves on eth3 and returns on eth1 (or vice versa)?

The problem here is that the Linux kernel will detect that the destination address is assigned to eth1, and will try to directly deliver the packets to eth1, even if the routing tables prescribe that the packets should be routed via eth3.

It's not possible to just remove the IP address from eth1, because the ping must be answered. So the only solution is to somehow use two different addresses (or to separate eth1 and eth3 from each other).

One way to do that is to use iptables, as in this answer linked by harrymc in the comments.

Another way I have tested on my machine with the following setup, using one network namespace to simulate the external network, and two network namespaces to separate the destination IP addresses:

Routing NS     Main NS      Two NS's

+----------+                   +----------+
|   veth0b |--- veth0a ....... | ipvl0    |
| 10.0.0.1 |    10.0.0.254     | 10.0.0.2 |
|          |                   +----------+
|          |                   +----------+
|   veth1b |--- veth1a ....... | ipvl1    |
| 10.0.1.1 |    10.0.1.254     | 10.0.1.2 |
+----------+                   +----------+

The Routing NS has forwarding enabled. The additional 10.0.*.2 addresses are assigned to an IPVLAN device, which one can think of as an extra IP address assigned to the master interface it is connected to. More details about IPVLAN e.g. here. Create like

ip link add ipvl0 link veth0a type ipvlan mode l2
ip link set ipvl0 netns nsx

where nsx is the new network namespace, then in that namespace,

ip netns exec nsx ip addr add 10.0.0.2/24 dev ipvl0
ip netns exec nsx ip link set ipvl0 up
ip netns exec nsx ip route add default via 10.0.0.1 dev ipvl0

The Main NS has the following routing rules in addtion to the default rules

ip route add 10.0.0.2/32 via 10.0.1.1 dev veth1a
ip route add 10.0.1.2/32 via 10.0.0.1 dev veth0a

and then ping 10.0.0.2 will do a "boomerang" round trip, as can be seen by tcpdump on both veth0a and veth1a. So with this setup, all logging can be done from the Main NS as far as pinging etc. is concerned, but more fancy tests with nc etc. might need the other namespaces at least to provide a receiver etc.

The LXC container uses network namespaces (and other namespaces). I am not too familiar with LXC containers, but if making new network namespaces inside the container is blocked, work from outside the container. First identify the name of the container with

ip netns list

and then do ip netns exec NAME_OF_LXC_NS ... as above. You can also delay moving eth1 and eth3 into the LXC container, and first create the two IPVLANs, and then move it into the container. Script as appropriate.

Edit

There's a third variant that works without network namespaces. The trick is to use policy routing, and give local lookup a higher ("worse") priority than normal, and treat packets from a socket bound to a specific interface differently. This prevents delivery to the local address, which was the main source of the problem.

With the same simulation setup as above minus the IPVLANs,

ip rule add pref 1000 lookup local
ip rule del pref 0
ip rule add pref 100 oif veth0a lookup 100
ip rule add pref 100 oif veth1a lookup 101
ip route add default dev veth0a via 10.0.0.1 table 100
ip route add default dev veth1a via 10.0.1.1 table 101

the commands

ping 10.0.1.254 -I veth0a
ping 10.0.0.254 -I veth1a

correctly egress ping requests. To also get a a ping reply, one must disable the tests against source spoofing:

echo "0" > /proc/sys/net/ipv4/conf/veth{0,1}a/rp_filter
echo "1" > /proc/sys/net/ipv4/conf/veth{0,1}a/accept_local

I also tried nc or socat, but I couldn't get them to work, because there are no options for nc to force the listener to answer on a specific device, and while there is such an option for socat, it doesn't seem to have an effect.

So network testing beyond pings is somewhat limited with this setup.

Related Question