Linux Networking – Link Aggregation for Bandwidth Not Working with LAG on Smart Switch

bandwidthbondinglinuxnetworking

My question is: why does setting Link Aggregation Groups on the smart switch lower the bandwidth between two machines?

I have finally achieved higher throughput (bandwidth) between two machines (servers running ubuntu 18.04 server) connected via 2 bonded 10G CAT7 cables via a TP-LINK T1700X-16TS smart switch. The cables are connected to single intel X550-T2 NIC in each machine (which has 2 RJ45 ports on each card), which is plugged into a PCI-E x8.

The first thing I did was create in the switch's configuration to create static LAG groups containing the two ports that each machine was connected to. This ended up being my first mistake.

On each box, created a bond which contained the two ports on the intel X550-T2 card. I am using netplan (and networkd). E.g.:

network:
 ethernets:
     ens11f0:
         dhcp4: no
         optional: true
     ens11f1:
         dhcp4: no
         optional: true
 bonds:
         bond0:
             mtu: 9000 #1500
             dhcp4: no
             interfaces: [ens11f0,ens11f1]
             addresses: [192.168.0.10/24]
             parameters:
                 mode: balance-rr
                 transmit-hash-policy: layer3+4 #REV: only good for xor ?
                 mii-monitor-interval: 1
                 packets-per-slave: 1

Note the 9000 byte MTU (for jumbo packets) and balance-rr.

Given these settings, I can now use iperf (iperf3) to test bandwidth between the machines:

iperf3 -s (on machine1)

iperf3 -c machine1 (on machine2)

I get something like 9.9 Gbits per second (very close to theoretical max of single 10G connection)

Something is wrong though. I'm using round-robin, and I have two 10G cables between the machines (theoretically). I should be able to get 20G bandwidth, right?

Wrong.

Weirdly, I next deleted the LAG groups from the smart switch. Now, on the linux side I have bonded interfaces, but to the switch, there are no bonds (no LAG).

Now I run iperf3 again:

[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  1.77 GBytes  15.2 Gbits/sec  540    952 KBytes       
[  4]   1.00-2.00   sec  1.79 GBytes  15.4 Gbits/sec  758    865 KBytes       
[  4]   2.00-3.00   sec  1.84 GBytes  15.8 Gbits/sec  736    454 KBytes       
[  4]   3.00-4.00   sec  1.82 GBytes  15.7 Gbits/sec  782    507 KBytes       
[  4]   4.00-5.00   sec  1.82 GBytes  15.6 Gbits/sec  582   1.19 MBytes       
[  4]   5.00-6.00   sec  1.79 GBytes  15.4 Gbits/sec  773    708 KBytes       
[  4]   6.00-7.00   sec  1.84 GBytes  15.8 Gbits/sec  667   1.23 MBytes       
[  4]   7.00-8.00   sec  1.77 GBytes  15.2 Gbits/sec  563    585 KBytes       
[  4]   8.00-9.00   sec  1.75 GBytes  15.0 Gbits/sec  407    839 KBytes       
[  4]   9.00-10.00  sec  1.75 GBytes  15.0 Gbits/sec  438    786 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  17.9 GBytes  15.4 Gbits/sec  6246             sender
[  4]   0.00-10.00  sec  17.9 GBytes  15.4 Gbits/sec                  receiver

Huh, now I get 15.4 Gbits/sec (sometimes up to 16.0).

The resends worry me (I was getting zero when I had the LAGs set up), but now I am getting at least some advantage.

Note, if I disable jumbo packets or set MTU to 1500, I get only about 4Gbps to 5Gbps.

Does anyone know why setting the Link Aggregation Groups on the smart switch (which I thought should help), instead limits the performance? On the other hand, not setting them (heck I could have saved my money and bought an unmanaged switch!) lets me send more packets which are routed correctly?

What is the point of the switch's LAG groups? Am I doing something wrong somewhere? I would like to increase bandwidth even more than 16Gbps if possible.

edit

Copying from my comment below (update):

I verified real application 11Gbps (1.25 GiB/sec) over my bonded connection, using nc (netcat) to copy a 60 GB file from a ramdisk on one system to another. I verified file integrity using hash, it is the same file on both sides.

Using only one of the 10G ports at a time (or bonded using balance-xor etc.), I get 1.15 GiB/sec (about 9.9 Gbps). Both iperf and nc use a TCP connection by default. Copying it to the local machine (via loopback) gets a speed of 1.5 GiB/sec. Looking at port usage on the switch, I see roughly equal usage on the sender Tx side (70% in the case of iperf, ~55% in the case of the nc file copy), and equal usage between the 2 bonded ports on the Rx side.

So, in the current setup (balance-rr, MTU 9000, no LAG groups defined on the switch), I can achieve more than 10Gbps, but only barely.

Oddly enough, defining LAG groups on the switch now breaks everything (iperf and file transfers now send 0 bytes). Probably just takes time for it to figure out new switching situation, but I re-ran many times and re-booted / reset the switch several times. So, I'm not sure why that is.

edit 2

I actually found mention of striping and balance-rr allowing higher than single port bandwidth in the kernel.org docs.

https://www.kernel.org/doc/Documentation/networking/bonding.txt

Specifically

12.1.1 MT Bonding Mode Selection for Single Switch Topology

This configuration is the easiest to set up and to understand,
although you will have to decide which bonding mode best suits your
needs. The trade offs for each mode are detailed below:

balance-rr: This mode is the only mode that will permit a single
TCP/IP connection to stripe traffic across multiple interfaces. It
is therefore the only mode that will allow a single TCP/IP stream to
utilize more than one interface's worth of throughput. This comes at
a cost, however: the striping generally results in peer systems
receiving packets out of order, causing TCP/IP's congestion control
system to kick in, often by retransmitting segments.

It is possible to adjust TCP/IP's congestion limits by altering the
net.ipv4.tcp_reordering sysctl parameter. The usual default value is
3. But keep in mind TCP stack is able to automatically increase this when it detects reorders.

Note that the fraction of packets that will be delivered out of
order is highly variable, and is unlikely to be zero. The level of
reordering depends upon a variety of factors, including the
networking interfaces, the switch, and the topology of the
configuration. Speaking in general terms, higher speed network
cards produce more reordering (due to factors such as packet
coalescing), and a "many to many" topology will reorder at a higher
rate than a "many slow to one fast" configuration.

Many switches do not support any modes that stripe traffic (instead
choosing a port based upon IP or MAC level addresses); for those
devices, traffic for a particular connection flowing through the
switch to a balance-rr bond will not utilize greater than one
interface's worth of bandwidth.

If you are utilizing protocols other than TCP/IP, UDP for example,
and your application can tolerate out of order delivery, then this
mode can allow for single stream datagram performance that scales
near linearly as interfaces are added to the bond.

This mode requires the switch to have the appropriate ports
configured for "etherchannel" or "trunking."

So, theoretically, balance-rr will allow me to stripe single TCP connection's packets. But, they may arrive out of order, etc.

However, it mentions that most switches do not support the striping. Which seems to be the case with my switch. Watching traffic during a real file transfer, Rx packets (i.e. sending_machine->switch) arrive evenly distributed over both bonded ports. However, Tx packets (switch->receiving_machine) only go out over one of the ports (and achieve 90% or more saturation).

By not explicitly setting up the Link Aggregation groups in the switch, I'm able to achieve higher throughput, but I'm not sure how the receiving machine is telling the switch to send one down one port, next down another, etc.

Conclusion:

The Switch Link Aggregation Groups do not support round-robin (i.e. port striping) for sending of packets. So, ignoring them allows me to get high throughput, but actual writing to memory (ramdisk) seems to hit a memory, CPU processing, or packet reordering saturation point.

I tried increasing/decreasing reordering, as well as read and write memory buffers for TCP using sysctl, with no change in performance. E.g.

sudo sysctl -w net.ipv4.tcp_reordering=50
sudo sysctl -w net.ipv4.tcp_max_reordering=1000

sudo sysctl -w net.core.rmem_default=800000000
sudo sysctl -w net.core.wmem_default=800000000
sudo sysctl -w net.core.rmem_max=800000000
sudo sysctl -w net.core.wmem_max=800000000

sudo sysctl -w net.ipv4.tcp_rmem=800000000
sudo sysctl -w net.ipv4.tcp_wmem=800000000

The only change in performance I notice is between machines with:
1) stronger processor (slightly higher single core clock, doesn't care about L3 cache)
2) faster memory? (or fewer DIMM for same amount of memory)

This seems to imply that I am hitting bus, CPU or memory read/write. A simple "copy" locally within a ramdisk (e.g. dd if=file1 of=file2 bs=1M) results in optimal speed of roughly 2.3GiB/sec on 2.6 Ghz, 2.2GiB/sec on 2.4 Ghz, and 2.0GiB/sec on 2.2 Ghz. The second one furthermore has slower memory, but it doesn't seem to matter.

All TCP copies TO the 2.6 Ghz ramdisk from the slower machines go at 1.15 GiB/s, from 2.4 Ghz go at 1.30 GiB/s, from fastest machine to middle machine go at 1.02 GiB/s, to slower machine (with faster memory) at 1.03 GiB/s, etc.

Biggest effect seems to be the single-core CPU and the memory clock on the receiving end. I have not compared BIOS settings, but all are running the same bios versions and use same motherboards, eth cards, etc.. Rearranging CAT7 cables or switch ports does not seem to have an effect.

I did find

http://louwrentius.com/achieving-340-mbs-network-file-transfers-using-linux-bonding.html

Who does this with four 1GbE connections. I tried setting up separate VLAN, but it did not work (did not increase speed).

Finally, sending to self using the same method seems to invoke a 0.3 GiB – 0.45 GiB/sec penalty. So, my observed values are not that much lower than the "theoretical" max for this method.

edit 3
(adding more info for posterity)

Even with balance-rr and LAG set on switch, I just realized that despite seeing 9.9 Gbps, retries in balance-rr are actually higher than in the case without the LAG! 2500 per second average with the groups, 1000 average without!

However, with groups set, I get average real file transfer speed memory to memory of 1.15 GiB/s (9.9 Gbps). If I only plug a single port in per machine, I see the same speed (1.15 GiB/s), and very few retries. If I switch the mode to balance-xor, I get 1.15 GiB/s (9.9 Gbps), and no resends. So, balance-rr mode is trying to stripe on the output to switch side of things, and that is causing a lot of out-of-order packets I guess.

Since my max (real-world) performance for memory-to-memory transfers is similar or higher using switch LAG and balance-xor, while having less resends (congestion), I am using that. However, since the eventual goal is NFS and MPI send, I will need to somehow find a way to saturate and measure network speed in those situations, which may depend upon how MPI connections are implemented…

Final Edit

I moved back to using balance-rr (with no LAG set on the switch side), since XOR will always hash to the same port for the same two peers. So it will only ever use one of the ports. Using balance-rr, if I run 2 or more (ram to ram) file transfers simultaneously, I can get net 18-19 Gbps, quite close to theoretic max of 20 Gbps.

Final Final Edit (after using for a few months)

I had to set the LAG groups in the switch, because I was getting errors where I could no longer SSH into machines, I assume because of packets getting confused where they were supposed to go with some addressing stuff. Now, I get only the maximum per connection of 10GBPS, but it is stable.

Best Answer

As I mentioned in my final edit, the reason that I am not able to get higher bandwidth using round-robin bonding when the switch has Link Aggregation Groups set is that switch Link Aggregation Groups do not do round-robin striping of packets on a single TCP connection, whereas the linux bonding does. This is mentioned in the kernel.org docs:

https://www.kernel.org/doc/Documentation/networking/bonding.txt

12.1.1 MT Bonding Mode Selection for Single Switch Topology

This configuration is the easiest to set up and to understand, although you will have to decide which bonding mode best suits your needs. The trade offs for each mode are detailed below:

balance-rr: This mode is the only mode that will permit a single TCP/IP connection to stripe traffic across multiple interfaces. It is therefore the only mode that will allow a single TCP/IP stream to utilize more than one interface's worth of throughput. This comes at a cost, however: the striping generally results in peer systems receiving packets out of order, causing TCP/IP's congestion control system to kick in, often by retransmitting segments.

It is possible to adjust TCP/IP's congestion limits by altering the net.ipv4.tcp_reordering sysctl parameter. The usual default value is 3. But keep in mind TCP stack is able to automatically increase this when it detects reorders.

Note that the fraction of packets that will be delivered out of order is highly variable, and is unlikely to be zero. The level of reordering depends upon a variety of factors, including the networking interfaces, the switch, and the topology of the configuration. Speaking in general terms, higher speed network cards produce more reordering (due to factors such as packet coalescing), and a "many to many" topology will reorder at a higher rate than a "many slow to one fast" configuration.

Many switches do not support any modes that stripe traffic (instead choosing a port based upon IP or MAC level addresses); for those devices, traffic for a particular connection flowing through the switch to a balance-rr bond will not utilize greater than one interface's worth of bandwidth.

If you are utilizing protocols other than TCP/IP, UDP for example, and your application can tolerate out of order delivery, then this mode can allow for single stream datagram performance that scales near linearly as interfaces are added to the bond.

This mode requires the switch to have the appropriate ports configured for "etherchannel" or "trunking."

The last note about having ports configured for "trunking" is odd, since when I make the ports in a LAG, all outgoing Tx from the switch go down a single port. Removing the LAG makes it send and receive half and half on each port, but results in many resends, I assume due to out-of-order packets. However, I still get an increase in bandwidth.

Related Question