Rate limit network but allow bursting per TCP connection before limiting

bandwidthnetworking

We have a Cisco router which allows for rate limiting (they call it policing) but permitting bursting on a per-TCP connection basis. For example, we can cap the bandwidth at 50mbit but the cap won't be imposed until 4 megabytes have been transferred. This is enforced per each TCP connection that is made.

Is there some way to do this in Linux? Also, are there any drawbacks to such a solution? In case it's helpful to anyone, the Cisco command for setting the bursting is the third parameter to the police command which is run under a policy-map (at least on our ASA 5505).

The goal of this is to allow a server to take advantage of 95/5 bursting and serve web pages as quickly as possible for normal users but reduce the chances of bursting more than 5% of the time (such as if doing a server to server transfer or large files being downloaded from a website). I understand with a DDoS attack that went on too long this might not be a solution, but for various reasons that's not a concern here.

Best Answer

This is doable in linux with iptables and tc. You configure iptables to MARK packets on a connection where some number of bytes have been transferred. You then use tc to put those marked packets in a class in a queuing discipline to ratelimit the bandwidth.

One somewhat tricky part is to limit the connection for both uploads and downloads. tc doesn't support traffic shaping of the ingress. You can get around this by shaping the egress on your webserver-facing interface (which will shape downloads to your webserver), and shaping egress on your upstream-provider facing interface (which will shape uploads from your webserver). You aren't really shaping the ingress (download) traffic, as you can't control how quickly your upstream provider sends data. But, shaping your webserver facing interface will result in packets being dropped and the uploader shrinking their TCP window to accommodate for the bandwidth limit.

Example: (assumes this is on a linux-based router, where web server facing interface is eth0 and upstream is eth1)

# mark the packets for connections over 4MB being forwarded out eth1
# (uploads from webserver)
iptables -t mangle -A FORWARD -p tcp -o eth1 -m connbytes --connbytes 4194304: --connbytes-dir both --connbytes-mode bytes -j MARK --set-mark 50

# mark the packets for connections over 4MB being forwarded out eth0
# (downloads to webserver)
iptables -t mangle -A FORWARD -p tcp -o eth0 -m connbytes --connbytes 4194304: --connbytes-dir both --connbytes-mode bytes -j MARK --set-mark 50

# Setup queuing discipline for server-download traffic
tc qdisc add dev eth0 root handle 1: htb
tc class add dev eth0 parent 1: classid 1:50 htb rate 50mbit

# Setup queuing discipline for server-upload traffic
tc qdisc add dev eth1 root handle 1: htb
tc class add dev eth1 parent 1: classid 1:50 htb rate 50mbit

# set the tc filters to catch the marked packets and direct them appropriately
tc filter add dev eth0 parent 1:0 protocol ip handle 50 fw flowid 1:50
tc filter add dev eth1 parent 1:0 protocol ip handle 50 fw flowid 1:50

If you want to do this on the webserver itself instead of on a linux router, you can still use the upload portions of the above stuff. One notable change is you'd replace FOWARD with OUTPUT. For download you'd need to setup a queuing discipline using an "Intermediate Functional Block" device, or ifb. In short, it uses a virtual interface so that you can treat ingress traffic as egress, and shape it from there using tc. More info on how to setup an ifb can be found here: https://serverfault.com/questions/350023/tc-ingress-policing-and-ifb-mirroring

Note that this type of stuff tends to require a lot of tuning to scale. One immediate concern is that connbytes relies upon the conntrack module, which tends to hit scaling walls with large numbers of connections. I'd recommend heavy load testing.

Another caveat is that this doesn't work at all for UDP, since it is stateless. There are other techniques to tackle that, but it looks like your requirements are for TCP only.

Also, to undo all of the above, do the following:

# Flush the mangle FORWARD chain (don't run this if you have other stuff in there)
iptables -t mangle -F FORWARD

# Delete the queuing disciplines
tc qdisc del dev eth0 root
tc qdisc del dev eth1 root

Related Solutions

Shell – What are the Linux shells that allow network traffic management (sending/receiving data to/from `/dev/tcp/host/port`) via redirectors

This is a feature of ATT ksh93 that was added in bash 2.04. None of the other common shells have it, in particular you won't find it in any of the dash variants, in pdksh or mksh (as of July 2015), in any of the BusyBox or Android variants, or in zsh¹ or fish. Bash can be compiled without this feature (see --enable-net-redirections), so it may be absent from embedded devices, even if they have bash.

To test whether the feature is present, you can check the error message when trying to open a TCP port for the discard service: both bash and ksh say “Connection refused” (or the command returns immediately if you have a discard service on your local machine), which you wouldn't see if this doesn't attempt to make some network connection.

: </dev/tcp/127.0.0.1/9

One reason this isn't a common feature is that a lot of network protocols require two-way communication with careful handling of who talks when, which shells aren't very convenient for. Another reason is that for the simple cases, pipes involving netcat, socat or socket also work. These programs handle binary data and can be servers as well as clients. Most protocols that can be handled in a simple way have specialized tools, wake-on-LAN excepted.

echo -e "GET / HTTP/1.1\n\n" | nc www.google.com 80

echo -e "GET / HTTP/1.1\n\n" | socat - TCP:www.google.com:80

(socket doesn't really work as a HTTP client since it doesn't close its write of the connection on the end of input.)

For scripting:

case $(export LANGUAGE=C LC_ALL=C; { : </dev/tcp/127.0.0.1/9; } 2>&1) in
  ""|*"Connection refused"*) echo "/dev/tcp is present";;
  *)
    if type nc >/dev/null 2>/dev/null; then
      echo "netcat is present"
    elif type socat >/dev/null 2>/dev/null; then
      echo "socat is present"
    else
      echo "I can't do TCP"
    fi
esac

¹ _{Zsh has a different way of using UDP and TCP sockets.}

Linux Networking – Link Aggregation for Bandwidth Not Working with LAG on Smart Switch

As I mentioned in my final edit, the reason that I am not able to get higher bandwidth using round-robin bonding when the switch has Link Aggregation Groups set is that switch Link Aggregation Groups do not do round-robin striping of packets on a single TCP connection, whereas the linux bonding does. This is mentioned in the kernel.org docs:

https://www.kernel.org/doc/Documentation/networking/bonding.txt

12.1.1 MT Bonding Mode Selection for Single Switch Topology

This configuration is the easiest to set up and to understand, although you will have to decide which bonding mode best suits your needs. The trade offs for each mode are detailed below:

balance-rr: This mode is the only mode that will permit a single TCP/IP connection to stripe traffic across multiple interfaces. It is therefore the only mode that will allow a single TCP/IP stream to utilize more than one interface's worth of throughput. This comes at a cost, however: the striping generally results in peer systems receiving packets out of order, causing TCP/IP's congestion control system to kick in, often by retransmitting segments.

It is possible to adjust TCP/IP's congestion limits by altering the net.ipv4.tcp_reordering sysctl parameter. The usual default value is 3. But keep in mind TCP stack is able to automatically increase this when it detects reorders.

Note that the fraction of packets that will be delivered out of order is highly variable, and is unlikely to be zero. The level of reordering depends upon a variety of factors, including the networking interfaces, the switch, and the topology of the configuration. Speaking in general terms, higher speed network cards produce more reordering (due to factors such as packet coalescing), and a "many to many" topology will reorder at a higher rate than a "many slow to one fast" configuration.

Many switches do not support any modes that stripe traffic (instead choosing a port based upon IP or MAC level addresses); for those devices, traffic for a particular connection flowing through the switch to a balance-rr bond will not utilize greater than one interface's worth of bandwidth.

If you are utilizing protocols other than TCP/IP, UDP for example, and your application can tolerate out of order delivery, then this mode can allow for single stream datagram performance that scales near linearly as interfaces are added to the bond.

This mode requires the switch to have the appropriate ports configured for "etherchannel" or "trunking."

The last note about having ports configured for "trunking" is odd, since when I make the ports in a LAG, all outgoing Tx from the switch go down a single port. Removing the LAG makes it send and receive half and half on each port, but results in many resends, I assume due to out-of-order packets. However, I still get an increase in bandwidth.

Best Answer

Related Solutions

Shell – What are the Linux shells that allow network traffic management (sending/receiving data to/from `/dev/tcp/host/port`) via redirectors

Linux Networking – Link Aggregation for Bandwidth Not Working with LAG on Smart Switch

Related Question