SSH: Fix Connection Freezes on 4G Hotspot

opensshssh

Short Description

I have been seeing an strange behaviour on my SSH connection for years but never thought of raising a question until today. I tried to search a lot about this but couldn't find any reason.

Environment

Basically, I have various AWS EC2 instances running on different regions (like Ireland, Mumbai etc.).
I have a Mac.
And I'm located in India (in case it strikes someone some reason).

Problem Statement 1

When my Mac is connected to a personal hotspot (from a Samsung device or from an iPhone) over 4G network, my SSH connection freezes after a few minutes (not more than 3) if I do not work on the SSH session (basically, the SSH connection went ideal). So I have to keep pressing the arrow key just to keep it alive.

Problem Statement 2 (which is not a problem)

But when my Mac is connected to a Wifi broadband connection, this problem never occurs. My SSH connection stays connected for hours even after I wake up my Mac from sleep (open the lid).

Based on my Googling again today, I found various articles which gives solution to use options like TCPKeepAlive or ServerAliveInterval:

But I couldn't find any post which dictates this problem. Does anyone of you have any idea about this behaviour? I'll be happy to provide you any possible details of my 4G hotspot connection.

Best Answer

I would surmise that a system tracking (and forgetting) connections statefully is causing this. When NAT is in use (and it's very often the case when not on IPv6) then usually the system doing NAT needs a memory to remember where to send back replies. For your Wifi broadband, the system doing NAT might have a longer memory to remember active connections (for example, Linux netfilter's conntrack by default remembers TCP connections for 5 days, while it remembers UDP flows for 2 or 3 minutes). The equivalent system doing NAT on your 4G path has probably a shorter memory, a bit less than 3mn.

To work around this, as you found and linked in your question, you can set the specific ssh parameter ServerAliveInterval that will send empty data (as SSH protocol) periodically when there's no activity in a way similar to TCP KeepAlive. This will make the connection always seen as active for the system doing NAT, and it won't forget it. So in your ~/.ssh/config file you could add:

ServerAliveInterval 115

with 115 a value chosen to be slightly less than 2mn to stay conservative: a value inferior to the estimated tracking duration of active connections on the invisible NAT device in the path, but not too low either (see later). So that at worse, when the tracking state is 5s from being about to be deleted, it gets back to the supposed 120s lifespan.

The drawback is that (on your Wifi broadband access anyway) if you lose connectivity for some time and then recover it, this might have made the client think the remote server was down and it will have closed the connection. You can also tweak ServerAliveCountMax for this, but anyway if the default value is 3, that would require something like 3*115=345s of connectivity loss, more than 5mn, before having a chance to have this problem.

The server side has an equivalent ClientAliveInterval that you can set there in its sshd_config file instead, for the same purpose. This would have the added benefit of not keeping around ghost ssh client connections seen as still connected for some length of time when the client side anyway lost connectivity.

Related Solutions

SSH connection freezes after bigger output when inactive for a while

Change ~/.ssh/ssh_config to ~/.ssh/config. Make sure the permissions on it are 700.

This discussion has a lot of good information. You can also follow the tag for ssh (just click on /ssh under your question) to go to a tag wiki for more information and trouble shooting guidance.

SSH Server Not Answering Connection Requests – Troubleshooting

A Very Disappointing Self-Answer

Having set this problem aside for a day and come back to it, I was both relieved and perturbed (more perturbed than relieved) to find that everything was, mysteriously, working properly.

So, What Was the Issue?

No settings were changed or adjusted -- not on the router, not on the SSH server, and not on the SSH client's machine. It's fairly safe to say it was the router not handling the incoming traffic properly, in spite of proper settings. Given that dinky home router software isn't really designed to deal with port forwarding, it took the poor guy a while to implement the necessary changes.

But It's Been Like 6 Hours!!

Yeah dude, I know. I spent all day trying to figure out what was wrong -- and didn't ever find it because there wasn't anything wrong. Evidently, it can take 6 hours -- possibly more -- for the router settings to take effect.

So How Do I Know If This Is My Issue?

A nifty tool I came across during this escapade is tcpdump. This lean little guy sniffs traffic for you, offering valuable insight into what's actually going on. Plus, he's got some super filtering features that allow you to narrow down exactly what you want to look at/for. For example, the command:

tcpdump -i wlan1 port 22 -n -Q inout

Tells tcpdump to look for traffic via the wlan1 interface (-i = 'interface'), only through port 22, ignore DNS name resolution (-n = 'no name resolution'), and we want to see both incoming and outgoing traffic (-Q accepts in, out, or inout; inout is the default).

By running this command on your SSH server while attempting to connect via a remote machine, it quickly becomes clear where precisely the problem lies. There are, essentially, 3 possibilities:

If you're seeing incoming traffic from the remote machine, but no outgoing traffic from your local server, the problem lies with the server: there's probably a firewall rule that needs to be changed, etc.
If you're seeing both incoming and outgoing, but your remote machine isn't receiving the response, it's most likely the router: it's allowing the incoming traffic, but dropping your outgoing packets.
If there's no traffic at all, that's probably a router issue as well: the remote machine's SYN packets are being ignored and dropped by the router before they even reach your server.

And once you've discovered where the problem lies, a fix is (usually) trivial.