Linux – minimal TCP MSS in Linux

linuxlinux-kernelnetworkingtcp

The TCP MSS in Linux must be at least 88 (include/net/tcp.h):

/* Minimal accepted MSS. It is (60+60+8) - (20+20). */
#define TCP_MIN_MSS             88U

My question is: where did they come up with "60 + 60 + 8" and why? I get that 20 + 20 comes from the IP header + TCP header.

EDIT: After taking a closer look at the headers, the formula looks for me like this:

(MAX_IP_HDR + MAX_TCP_HDR + MIN_IP_FRAG) - (MIN_IP_HDR + MIN_TCP_HDR)

The question still stands: why? Why does the Linux kernel use this formula, thereby prohibiting (a forced flow of) TCP segments of, say, 20 bytes? Think iperf here.

EDIT2: Here's my use case. By forcing a low MSS on socket/connection, all the packets sent by the stack will have a small size. I want to set a low MSS when working with iperf for packets/second testing. I can't get IP packets smaller than 128 bytes (Ethernet frames of 142 bytes) on the wire because of this lower limit for the MSS! I would like to get as close to an Ethernet frame size of 64 bytes as per RFC 2544. Theoretically this should be possible: 18 + 20 + 20 < 64.

Best Answer

An implementation is required to support the maximum-sized TCP and IP headers, which are 60 bytes each.

An implementation must support 576-byte datagrams, which even with maximum-headers means more than 8 bytes of data in the datagram. To send datagrams with more than 8 bytes of data, IP fragmentation must put at least 8 bytes of data in at least one of the packets that represent the fragments of the datagram. Thus an implementation must support at least 8 bytes of data in a packet.

Putting this together, an implementation must support 60+60+8 byte packets.

When we send packets that are part of a TCP stream, they have a 20-byte IP header (plus options) and a 20-byte TCP header (plus options). That leaves a minimum of (60+60+8)-(20+20) bytes remaining for data and options. Hence this is the maximum we can safely assume an implementation's TCP MSS.

Related Solutions

TCP – Troubleshooting TCP Issues on a Linux Laptop

In the capture you provided, the Time Stamp Echo Reply in the SYN-ACK in the second packet doesn't match the TSVal in the SYN in the first packet and is a few seconds behind.

And see how all the TSecr sent by both 173.194.70.108 and 209.85.148.100 are all the same and irrelevant from the TSVal you send.

It looks like there's something that mingles with the TCP timestamps. I have no idea what may be causing that, but it sounds like it is outside your machine. Does rebooting the router help in this instance?

I don't know if it's what's causing your machine to send a RST (on the 3rd packet). But it definitely doesn't like that SYN-ACK, and it's the only thing wrong I can find about it. The only other explanation I can think of is if it's not your machine that is sending the RST but given the time difference between the SYN-ACK and RST I would doubt so. But just in case, do you use virtual machines or containers or network namespaces on this machine?

You could try disabling TCP timestamps altogether to see if that helps:

sudo sysctl -w net.ipv4.tcp_timestamps=0

So, either those sites send bogus TSecr or there's something on the way there (any router on the way, or transparent proxy) that mangles either the outgoing TSVal or the incoming TSecr, or a proxy with a bogus TCP stack. Why one would mangle the tcp timestamps I can only speculate: bug, intrusion detection evasion, a too-smart/bogus traffic shaping algorithm. That's not something I've heard of before (but then I'm no expert on the subject).

How to investigate further:

See if the TPLink router is to blame why resetting it to see if that helps or capture the traffic on the outside as well if possible to see if it does mangle the timestamps
Check whether there's a transparent proxy on the way by playing with TTLs, looking at request headers received by web servers or see behaviour when requesting dead websites.
capture traffic on a remote web server to see if it's the TSVal or TSecr that is mangled.

Linux – Changing the TCP RTO value in Linux

The reason you can't alter the RTO specifically is because it is not a static value. Instead (except for the initial SYN, naturally) it is based on the RTT (Round Trip Time) for each connection. Actually, it is based on a smoothed version of RTT and the RTT variance with some constants thrown into the mix. Hence, it is a dynamic, calculated value for each TCP connection, and I highly recommend this article which goes into more detail on the calculation and RTO in general.

Also relevant is RFC 6298 which states (among a lot of other things):

Whenever RTO is computed, if it is less than 1 second, then the RTO SHOULD be rounded up to 1 second.

Does the kernel always set RTO to 1 second then? Well, with Linux you can show the current RTO values for your open connections by running the ss -i command:

State       Recv-Q Send-Q                                                  Local Address:Port     Peer Address:Port
ESTAB       0      0                                                           10.0.2.15:52861   216.58.219.46:http
     cubic rto:204 rtt:4/2 cwnd:10 send 29.2Mbps rcv_space:14600
ESTAB       0      0                                                           10.0.2.15:ssh          10.0.2.2:52586
     cubic rto:201 rtt:1.5/0.75 ato:40 cwnd:10 send 77.9Mbps rcv_space:14600
ESTAB       0      0                                                           10.0.2.15:52864   216.58.219.46:http
     cubic rto:204 rtt:4.5/4.5 cwnd:10 send 26.0Mbps rcv_space:14600

The above is the output from a VM which I am logged into with SSH and has a couple of connections open to google.com. As you can see, the RTO is in fact set to 200-ish (milliseconds). You will note that is not rounded to the 1 second value from the RFC, and you may also think that it's a little high. That's because there are min (200 milliseconds) and max (120 seconds) bounds in play when it comes to RTO for Linux (there is a great explanation of this in the article I linked above).

So, you can't alter the RTO value directly, but for lossy networks (like wireless) you can try tweaking F-RTO (this may already be enabled depending on your distro). There are actually two related options related to F-RTO that you can tweak (good summary here):

net.ipv4.tcp_frto
net.ipv4.tcp_frto_response

Depending on what you are trying to optimize for, these may or may not be useful.

EDIT: following up on the ability to tweak the rto_min/max values for TCP from the comments.

You can't change the global minimum RTO for TCP (as an aside, you can do it for SCTP - those are exposed in sysctl), but the good news is that you can tweak the minimum value of the RTO on a per-route basis. Here's my routing table on my CentOS VM:

ip route
10.0.2.0/24 dev eth0  proto kernel  scope link  src 10.0.2.15 
169.254.0.0/16 dev eth0  scope link  metric 1002 
default via 10.0.2.2 dev eth0

I can change the rto_min value on the default route as follows:

ip route change default via 10.0.2.2 dev eth0 rto_min 5ms

And now, my routing table looks like this:

ip route
10.0.2.0/24 dev eth0  proto kernel  scope link  src 10.0.2.15 
169.254.0.0/16 dev eth0  scope link  metric 1002 
default via 10.0.2.2 dev eth0  rto_min lock 5ms

Finally, let's initiate a connection and check out ss -i to see if this has been respected:

ss -i
State       Recv-Q Send-Q                                               Local Address:Port                                                   Peer Address:Port   
ESTAB       0      0                                                        10.0.2.15:ssh                                                        10.0.2.2:50714   
     cubic rto:201 rtt:1.5/0.75 ato:40 cwnd:10 send 77.9Mbps rcv_space:14600
ESTAB       0      0                                                        10.0.2.15:39042                                                 216.58.216.14:http    
     cubic rto:15 rtt:5/2.5 cwnd:10 send 23.4Mbps rcv_space:14600

Success! The rto on the HTTP connection (after the change) is 15ms, whereas the SSH connection (before the change) is 200+ as before.

I actually like this approach - it allows you to set the lower value on appropriate routes rather than globally where it might screw up other traffic. Similarly (see the ip man page) you can tweak the initial rtt estimate and the initial rttvar for the route (used when calculating the dynamic RTO). While it's not a complete solution in terms of tweaking, I think most of the important pieces are there. You can't tweak the max setting, but I think that is not going to be as useful generally in any case.

Best Answer

Related Solutions

TCP – Troubleshooting TCP Issues on a Linux Laptop

Linux – Changing the TCP RTO value in Linux

Related Question