Many thanks to the people who submitted ideas in the comments. I went through them all:
Recording packets with tcpdump and comparing the contents in WireShark
# tcpdump -i wlan0 -w good.ssh & \
cat signature | ssh -o "ProxyCommand nc %h %p" \
root@192.168.1.150 'cat | md5sum' ; \
killall tcpdump
# tcpdump -i wlan0 -w bad.ssh & \
cat signature | ssh root@192.168.1.150 'cat | md5sum' ; \
killall tcpdump
There was no difference of any importance in the recorded packets.
Checking for traffic shaping
Had no idea about this - but after looking at the "tc" manpage, I was able to verify that
tc filter show
returns nothing
tc class show
returns nothing
tc qdisc show
...returns these:
qdisc noqueue 0: dev lo root refcnt 2
qdisc noqueue 0: dev docker0 root refcnt 2
qdisc fq_codel 0: dev wlan0 root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
...which don't seem to differentiate between "ssh" and "nc" - in fact, I am not even sure if traffic shaping can operate on the process level (I'd expect it to work on addresses/ports/Differentiated Services field in IP Header).
Debian Chroot, to avoid potential "cleverness" in Arch Linux SSH client
Nope, same results.
Finally - Nagle
Performing an strace in the sender...
pv data | strace -T -ttt -f ssh 192.168.1.150 'cat | md5sum' 2>bad.log
...and looking at what exactly happens on the socket that transmits the data across, I noticed this "setup" before the actual transmitting starts:
1522665534.007805 getsockopt(3, SOL_TCP, TCP_NODELAY, [0], [4]) = 0 <0.000025>
1522665534.007899 setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0 <0.000021>
This sets up the SSH socket to disable Nagle's algorithm. You can Google and read all about it - but what it means, is that SSH is giving priority to responsiveness over bandwidth - it instructs the kernel to transmit anything written on this socket immediately and not "delay" waiting for acknowledgments from the remote.
What this means, in plain terms, is that in its default configuration, SSH is NOT a good way to transport data across - not when the link used is a slow one (which is the case for many WiFi links). If we are sending packets over the air that are "mostly headers", the bandwidth is wasted!
To prove that this was indeed the culprit, I used LD_PRELOAD to "drop" this specific syscall:
$ cat force_nagle.c
#include <stdio.h>
#include <dlfcn.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <sys/socket.h>
int (*osetsockopt) (int socket, int level, int option_name,
const void *option_value, socklen_t option_len) = NULL;
int setsockopt(int socket, int level, int option_name,
const void *option_value, socklen_t option_len)
{
int ret;
if (!osetsockopt) {
osetsockopt = dlsym(RTLD_NEXT, "setsockopt");
}
if (option_name == TCP_NODELAY) {
puts("No, Mr Nagle stays.");
return 0;
}
ret = osetsockopt(socket, level, option_name, option_value, option_len);
return ret;
}
$ gcc -fPIC -D_GNU_SOURCE -shared -o force_nagle.so force_nagle.c -ldl
$ pv /dev/shm/data | LD_PRELOAD=./force_nagle.so ssh root@192.168.1.150 'cat >/dev/null'
No, Mr Nagle stays.
No, Mr Nagle stays.
100MiB 0:00:29 [3.38MiB/s] [3.38MiB/s] [================================>] 100%
There - perfect speed (well, just as fast as iperf3).
Morale of the story
Never give up :-)
And if you do use tools like rsync
or borgbackup
that transport their data over SSH, and your link is a slow one, try stopping SSH from disabling Nagle (as shown above) - or using ProxyCommand
to switch SSH to connect via nc
. This can be automated in your $HOME/.ssh/config:
$ cat .ssh/config
...
Host orangepi
Hostname 192.168.1.150
User root
Port 22
# Compression no
# Cipher None
ProxyCommand nc %h %p
...
...so that all future uses of "orangepi" as a target host in ssh/rsync/borgbackup will henceforth use nc
to connect (and therefore leave Nagle alone).
Best Answer
ss
uses theAF_NETLINK
socket layer to talk to the kernel. This is a lower level protocol but allows for data to be transferred very quickly and in large chunks. A quickstrace
on CentOS 7 shows it sets the transfer window to be 1Mb.