Tc Qdisc Delay – Why Tc Qdisc Delay is Not Seen in Tcpdump Recording

delaynetworkingtctcpdumpwireshark

I have two linux containers connected with a veth-pair. At veth-interface of one container I set up tc qdisc netem delay and send traffic from it to the other container. If I watch traffic on both sides using tcpdump/wireshark it can be seen that timestamps of the same packet at sender and receiver do not differ by selected delay.

I wanted to understand more in detail at which point libpcap puts timestamps to egress traffic corresponding to tc qdisc. I searched for a scheme/image on Internet but did not find. I found this topic (wireshark packet capture point) but it advises to introduce an indirection by having one more container/interface. This is not a possible solution in my situation. Is there any way to solve the problem not introducing additional intermediate interfaces (that is, not changing topology) and only by recording at the already given veth-interface but in such a way that the delay can be seen?

UPDATE:

I was too quick and so got mistaken. Neither my solution present below (same as the first variant of solution of the answer of @A.B), nor the solution with IFB of @A.B (I have already checked) solve my problem. The problem is with overflow of transmit queue of interface a1-eth0 of sender in the topology:

[a1-br0 ---3Gbps---a1-eth0]---100Mbps---r1---100Mbps---r2

I was too quick and checked only for delay 10ms at link between a1-eth0 and router r1. Today I tried to make the delay higher: 100ms, 200ms and the results (per-packet delay and rate graphs which I get) start to differ for the topology above and for the normal topology:

[a1-eth0]---100Mbps---r1---100Mbps---r2

So no, certainly, for accurate testing I cannot have extra links: nor introduced by Linux bridge, nor by this IFB, nor by any other third system. I test congestion control schemes. And I want to do it in a specific topology. And I cannot change the topology just for the sake of plotting — I mean if at the same time my rate and delay results/plots get changed.

UPDATE 2:

So it looks like the solution has been found, as it can be seen below (NFLOG solution).

UPDATE 3:

Here are described some disadvantages of NFLOG solution (big Link-Layer headers and wrong TCP checksums for egress TCP packets with zero payload) and proposed a better solution with NFQUEUE which does not have any of these problems: TCP checksum wrong for zero length egress packets (captured with iptables). However, for my tasks (testing of congestion control schemes) neither NFLOG, nor NFQUEUE are suitable. As it is explained by the same link, sending rate gets throttled when packets get captured from kernel's iptables (this is how I understand it). So when you record at sender by capturing from interface (i.e., regularly) you get 2 Gigabytes dump, while if you record at sender by capturing from iptables you get 1 Gigabyte dump. Roughly speaking.

UPDATE 4:

Finally, in my project I use Linux bridge solution described in my own answer bewow.

Best Answer

According to the Packet flow in Netfilter and General Networking schematic, tcpdump captures (AF_PACKET) after egress (qdisc). So it's normal you don't see the delay in tcpdump: the delay was already present at initial capture.

You'd have to capture it one step earlier, so involve a 3rd system:

S1: system1, runs tcpdump on outgoing interface
R: router (or bridge, at your convenience, this changes nothing), runs the qdisc netem
S2: system2, runs tcpdump on incoming interface

 __________________     ________________     __________________
|                  |   |                |   |                  |
| (S1) -- tcpdump -+---+- (R) -- netem -+---+- tcpdump -- (S2) |
|__________________|   |________________|   |__________________|

That means 3 network stacks involved, be they real, vm, network namespace (including ip netns, LXC, ...)


Optionally, It's also possible to cheat and move every special settings on the router (or bridge) by using an IFB interface with mirred traffic: allows by a trick (dedicated for this case) to insert netem sort-of-after ingress rather than on egress:

 _______     ______________________________________________     _______
|       |   |                                              |   |       |         
| (S1) -+---+- tcpdump -- ifb0 -- netem -- (R) -- tcpdump -+---+- (S2) |
|_______|   |______________________________________________|   |_______|

There's a basic IFB usage example in tc mirred manpage:

Using an ifb interface, it is possible to send ingress traffic through an instance of sfq:

# modprobe ifb
# ip link set ifb0 up
# tc qdisc add dev ifb0 root sfq
# tc qdisc add dev eth0 handle ffff: ingress
# tc filter add dev eth0 parent ffff: u32 \
  match u32 0 0 \
  action mirred egress redirect dev ifb0

Just use netem on ifb0 instead of sfq (and in non-initial network namespace, ip link add name ifbX type ifb works fine, without modprobe).

This still requires 3 network stacks for proper working.


using NFLOG

After a suggestion from JenyaKh, it turns out it's possible to capture a packet with tcpdump, before egress (thus before qdisc) and then on egress (after qdisc): by using iptables (or nftables) to log full packets to the netlink log infrastructure, and still reading them with tcpdump, then again using tcpdump on the egress interface. This requires only settings on S1 (and doesn't need a router/bridge anymore).

So with iptables on S1, something like:

iptables -A OUTPUT -o eth0 -j NFLOG --nflog-group 1

Specific filters should probably be added to match the test done, because tcpdump filter is limited on nflog interface (wireshark should handle it better).

If the answer capture is needed (here done in a different group, thus requiring an additional tcpdump):

iptables -A INPUT -i eth0 -j NFLOG --nflog-group 2

Depending on needs it's also possible to move them to raw/OUTPUT and raw/PREROUTING instead.

With tcpdump:

# tcpdump -i nflog:1 -n -tt ...

If a different group (= 2) was used for input:

# tcpdump -i nflog:2 -n -tt ...

Then at the same time, as usual:

# tcpdump -i eth0 -n -tt ...
Related Question