As I mentioned in my final edit, the reason that I am not able to get higher bandwidth using round-robin bonding when the switch has Link Aggregation Groups set is that switch Link Aggregation Groups do not do round-robin striping of packets on a single TCP connection, whereas the linux bonding does. This is mentioned in the kernel.org docs:
12.1.1 MT Bonding Mode Selection for Single Switch Topology
This configuration is the easiest to set up and to understand, although you will have to decide which bonding mode best suits your
needs. The trade offs for each mode are detailed below:
balance-rr: This mode is the only mode that will permit a single TCP/IP connection to stripe traffic across multiple interfaces. It is
therefore the only mode that will allow a single TCP/IP stream to
utilize more than one interface's worth of throughput. This comes at a
cost, however: the striping generally results in peer systems
receiving packets out of order, causing TCP/IP's congestion control
system to kick in, often by retransmitting segments.
It is possible to adjust TCP/IP's congestion limits by altering the net.ipv4.tcp_reordering sysctl parameter. The usual default value
is 3. But keep in mind TCP stack is able to automatically increase
this when it detects reorders.
Note that the fraction of packets that will be delivered out of order is highly variable, and is unlikely to be zero. The level of
reordering depends upon a variety of factors, including the networking
interfaces, the switch, and the topology of the configuration.
Speaking in general terms, higher speed network cards produce more
reordering (due to factors such as packet coalescing), and a "many to
many" topology will reorder at a higher rate than a "many slow to one
fast" configuration.
Many switches do not support any modes that stripe traffic (instead choosing a port based upon IP or MAC level addresses); for
those devices, traffic for a particular connection flowing through the
switch to a balance-rr bond will not utilize greater than one
interface's worth of bandwidth.
If you are utilizing protocols other than TCP/IP, UDP for example, and your application can tolerate out of order delivery, then this
mode can allow for single stream datagram performance that scales near
linearly as interfaces are added to the bond.
This mode requires the switch to have the appropriate ports configured for "etherchannel" or "trunking."
The last note about having ports configured for "trunking" is odd, since when I make the ports in a LAG, all outgoing Tx from the switch go down a single port. Removing the LAG makes it send and receive half and half on each port, but results in many resends, I assume due to out-of-order packets. However, I still get an increase in bandwidth.
Best Answer
This is doable in linux with
iptables
andtc
. You configure iptables toMARK
packets on a connection where some number of bytes have been transferred. You then usetc
to put those marked packets in a class in a queuing discipline to ratelimit the bandwidth.One somewhat tricky part is to limit the connection for both uploads and downloads.
tc
doesn't support traffic shaping of the ingress. You can get around this by shaping the egress on your webserver-facing interface (which will shape downloads to your webserver), and shaping egress on your upstream-provider facing interface (which will shape uploads from your webserver). You aren't really shaping the ingress (download) traffic, as you can't control how quickly your upstream provider sends data. But, shaping your webserver facing interface will result in packets being dropped and the uploader shrinking their TCP window to accommodate for the bandwidth limit.Example: (assumes this is on a linux-based router, where web server facing interface is
eth0
and upstream iseth1
)If you want to do this on the webserver itself instead of on a linux router, you can still use the upload portions of the above stuff. One notable change is you'd replace
FOWARD
withOUTPUT
. For download you'd need to setup a queuing discipline using an "Intermediate Functional Block" device, orifb
. In short, it uses a virtual interface so that you can treat ingress traffic as egress, and shape it from there usingtc
. More info on how to setup anifb
can be found here: https://serverfault.com/questions/350023/tc-ingress-policing-and-ifb-mirroringNote that this type of stuff tends to require a lot of tuning to scale. One immediate concern is that
connbytes
relies upon theconntrack
module, which tends to hit scaling walls with large numbers of connections. I'd recommend heavy load testing.Another caveat is that this doesn't work at all for UDP, since it is stateless. There are other techniques to tackle that, but it looks like your requirements are for TCP only.
Also, to undo all of the above, do the following: