Debian SSH – Fix Jumphost Resetting First SSH MUX Connection Attempts

debiansshvmware

I have been using a Debian 9 SSH jumpbox host to run my scripts/ansible playbooks for a while. The jumbox talks with Debian 9 and some Debian 8 servers, mostly. Most of the servers are VMs running under VMWare Enterprise 5.5.

The SSH client in the jumbox is configured for doing SSH MUX, and the authentication is done by an RSA certificate file.

SSH has been working well for years now, however suddenly SSH connections started giving the error ssh_exchange_identification: read: Connection reset by peer at first try, several times a day, which obviously creates havoc with my scripts and scripts of our development team.

However, after the first try they are ok for a while. The servers misbehaving appear to be random at first, but there are some patterns/timeouts. If I do send a command to all of the servers, for instance, running in a command before the intended script/playbook, a few will fail, but the next script will run in all of them.

There havent been recent significant changes on the servers, except for security updates. The transition for Debian 9 has already some (significant) time.

I already found a MTU (mis)configuration or other that was once applied to several servers in a malfunction and forgotten, however that was not the case. I also diminished both on the client and server side the ControlPersist and ClientAliveInterval both to 1h, and that did not improve the situation.

So at the moment, I am at loss of why this is happening. I am however more inclined to a layer 7 issue than a network problem.

The SSH configuration on the client side /etc/ssh_config, Debian 9 is:

Host *
SendEnv LANG LC_*
HashKnownHosts yes
GSSAPIAuthentication yes
GSSAPIDelegateCredentials no
ControlMaster auto
ControlPath /tmp/ssh_mux_%h_%p_%r
ControlPersist 1h
Compression no
UseRoaming no

On SSH on the server side of several Debian servers:

Protocol 2
HostKey /etc/ssh/ssh_host_rsa_key
HostKey /etc/ssh/ssh_host_dsa_key
UsePrivilegeSeparation yes

SyslogFacility AUTH
LogLevel INFO
LoginGraceTime 120
PermitRootLogin forced-commands-only 
StrictModes yes
PubkeyAuthentication yes
IgnoreRhosts yes
HostbasedAuthentication no
PermitEmptyPasswords no
ChallengeResponseAuthentication no
PasswordAuthentication no

X11Forwarding no
X11DisplayOffset 10
PrintMotd no
PrintLastLog yes
TCPKeepAlive yes

AcceptEnv LANG LC_*
Subsystem sftp /usr/lib/openssh/sftp-server -l INFO
UsePAM yes
ClientAliveInterval 3600
ClientAliveCountMax 0
AddressFamily inet

SSH versions:

client –

$ssh -V 
OpenSSH_7.4p1 Debian-10+deb9u1, OpenSSL 1.0.2l  25 May 2017

server(s)

SSH-2.0-OpenSSH_7.4p1 Debian-10+deb9u1 (Debian 9)
SSH-2.0-OpenSSH_6.7p1 Debian-5+deb8u3  (Debian 8)

I have seen that error at least in situations with both servers with the 4.9.0-0.bpo.1-amd64 version.

Here follows the tcpdump of a server misbehaving, both machines being in the same network without any firewalls in the middle. I also monitor MAC addresses and there is not log of a new machine/MAC with the same MAC addresses in the last few years.

#tcpdump port 22
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
19:42:25.462896 IP jumbox.40270 > server.ssh: Flags [S], seq 3882361678, win 23200, options [mss 1160,sackOK,TS val 354223428 ecr 0,nop,wscale 7], length 0
19:42:25.463289 IP server.ssh > jumbox.40270: Flags [S.], seq 405921081, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:25.463306 IP jumbox.40270 > server.ssh: Flags [.], ack 1, win 182, length 0
19:42:25.481470 IP server.ssh > jumbox.40270: Flags [S.], seq 4195986320, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:25.481477 IP jumbox.40270 > server.ssh: Flags [.], ack 504902058, win 182, length 0
19:42:25.481490 IP server.ssh > jumbox.40270: Flags [R], seq 405921082, win 0, length 0
19:42:25.481494 IP server.ssh > jumbox.40270: Flags [P.], seq 504902058:504902097, ack 1, win 182, length 39
19:42:26.491536 IP server.ssh > jumbox.40270: Flags [S.], seq 4195986320, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:26.491551 IP jumbox.40270 > server.ssh: Flags [R], seq 3882361679, win 0, length 0
19:42:28.507528 IP server.ssh > jumbox.40270: Flags [S.], seq 4195986320, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:28.507552 IP jumbox.40270 > server.ssh: Flags [R], seq 3882361679, win 0, length 0
19:42:32.699540 IP server.ssh > jumbox.40270: Flags [S.], seq 4195986320, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:32.699556 IP jumbox.40270 > server.ssh: Flags [R], seq 3882361679, win 0, length 0
19:42:40.891490 IP server.ssh > jumbox.40270: Flags [S.], seq 4195986320, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:40.891514 IP jumbox.40270 > server.ssh: Flags [R], seq 3882361679, win 0, length 0
19:42:57.019511 IP server.ssh > jumbox.40270: Flags [S.], seq 4195986320, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:57.019534 IP jumbox.40270 > server.ssh: Flags [R], seq 3882361679, win 0, length 0

An ssh -v server log of a failed connection, with the reset error:

OpenSSH_7.4p1 Debian-10+deb9u1, OpenSSL 1.0.2l  25 May 2017
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
debug1: /etc/ssh/ssh_config line 59: Deprecated option "useroaming"
debug1: auto-mux: Trying existing master
debug1: Control socket "/tmp/ssh_mux_fenix-storage_22_rui" does not exist
debug1: Connecting to fenix-storage [10.10.32.156] port 22.
debug1: Connection established.
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_rsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_rsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_dsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_dsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ecdsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ecdsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ed25519 type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ed25519-cert type -1
debug1: Enabling compatibility mode for protocol 2.0
write: Connection reset by peer

An ssh -v server of a successful connection:

OpenSSH_7.4p1 Debian-10+deb9u1, OpenSSL 1.0.2l  25 May 2017
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
debug1: /etc/ssh/ssh_config line 59: Deprecated option "useroaming"
debug1: auto-mux: Trying existing master
debug1: Control socket "/tmp/ssh_mux_sql01_22_rui" does not exist
debug1: Connecting to sql01 [10.20.10.88] port 22.
debug1: Connection established.
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_rsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_rsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_dsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_dsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ecdsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ecdsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ed25519 type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ed25519-cert type -1
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_7.4p1 Debian-10+deb9u1
debug1: Remote protocol version 2.0, remote software version OpenSSH_7.4p1 Debian-10+deb9u1
debug1: match: OpenSSH_7.4p1 Debian-10+deb9u1 pat OpenSSH* compat 0x04000000
debug1: Authenticating to sql01:22 as 'rui'
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: algorithm: curve25519-sha256
debug1: kex: host key algorithm: rsa-sha2-512
debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
debug1: Server host key: ssh-rsa SHA256:6aJ+ipXRZJfbei5YbYtvqKXB01t1YO34O2ChdT/vk/4
debug1: Host 'sql01' is known and matches the RSA host key.
debug1: Found key in /home/rui/.ssh/known_hosts:315
debug1: rekey after 134217728 blocks
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug1: SSH2_MSG_NEWKEYS received
debug1: rekey after 134217728 blocks
debug1: SSH2_MSG_EXT_INFO received
debug1: kex_input_ext_info: server-sig-algs=<ssh-ed25519,ssh-rsa,ssh-dss,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521>
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Authentications that can continue: publickey
debug1: Next authentication method: publickey
debug1: Offering RSA public key: /home/rui/.ssh/id_rsa
debug1: Server accepts key: pkalg ssh-rsa blen 277
debug1: Authentication succeeded (publickey).
Authenticated to sql01 ([10.20.10.88]:22).
debug1: setting up multiplex master socket
debug1: channel 0: new [/tmp/ssh_mux_sql01_22_rui]
debug1: control_persist_detach: backgrounding master process
debug1: forking to background
debug1: Entering interactive session.
debug1: pledge: id
debug1: multiplexing control connection
debug1: channel 1: new [mux-control]
debug1: channel 2: new [client-session]
debug1: client_input_global_request: rtype hostkeys-00@openssh.com want_reply 0
debug1: Sending environment.
debug1: Sending env LC_ALL = en_US.utf8
debug1: Sending env LANG = en_US.UTF-8
debug1: mux_client_request_session: master session id: 2

Interestingly enough, the behaviour can be reproduced with a telnet command:

$ telnet remote-server 22
Trying x.x.x.x...
Connected to remote-server
Escape character is '^]'.
Connection closed by foreign host.
$ telnet remote-server 22
Trying x.x.x.x...
Connected to remote-server
Escape character is '^]'.
SSH-2.0-OpenSSH_7.4p1 Debian-10+deb9u1

Protocol mismatch.
Connection closed by foreign host.

UPDATE:

Forced Protocol 2 in the /etc/ssh_client client configuration in the jumpbox. No change.

UPDATE2:

Changed the old key encrypted with DES-EDE3-CBC for a new key encrypted with AES-128-CBC. Again no visible change.

UPDATE3:

Interestingly enough, while the mux is active, the situation does not presents itself.

UPDATE4:

I also have found a similar question at serverfault, however without a chosen answer: https://serverfault.com/questions/445045/ssh-connection-error-ssh-exchange-identification-read-connection-reset-by-pe

Tried regenerating the ssh host keys, and the suggestion of sshd: ALL without success.

UPDATE 5

Opened a console on the VM on the destination and saw something 'strange'.
tcpdump whereas 1.1.1.1 is the jumpbox.

# tcpdump -n -vvv "host 1.1.1.1"
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
11:47:45.808273 IP (tos 0x0, ttl 64, id 38171, offset 0, flags [DF], proto TCP (6), length 60)
    1.1.1.1.37924 > 1.1.1.2.22: Flags [S], cksum 0xfc1f (correct), seq 3260568985, win 29200, options [mss 1460,sackOK,TS val 407355522 ecr 0,nop,wscale 7], length 0
11:47:45.808318 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    1.1.1.2.22 > 1.1.1.1.37924: Flags [S.], cksum 0x5508 (incorrect -> 0x68a8), seq 2881609759, ack 3260568986, win 28960, options [mss 1460,sackOK,TS val 561702650 ecr 407355522,nop,wscale 7], length 0
11:47:45.808525 IP (tos 0x0, ttl 64, id 38172, offset 0, flags [DF], proto TCP (6), length 52)
    1.1.1.1.37924 > 1.1.1.2.22: Flags [.], cksum 0x07b0 (correct), seq 1, ack 1, win 229, options [nop,nop,TS val 407355522 ecr 561702650], length 0
11:47:45.808917 IP (tos 0x0, ttl 64, id 38173, offset 0, flags [DF], proto TCP (6), length 92)
    1.1.1.1.37924 > 1.1.1.2.22: Flags [P.], cksum 0x6de0 (correct), seq 1:41, ack 1, win 229, options [nop,nop,TS val 407355522 ecr 561702650], length 40
11:47:45.808930 IP (tos 0x0, ttl 64, id 1754, offset 0, flags [DF], proto TCP (6), length 52)
    1.1.1.2.22 > 1.1.1.1.37924: Flags [.], cksum 0x5500 (incorrect -> 0x0789), seq 1, ack 41, win 227, options [nop,nop,TS val 561702651 ecr 407355522], length 0
11:47:45.822178 IP (tos 0x0, ttl 64, id 1755, offset 0, flags [DF], proto TCP (6), length 91)
    1.1.1.2.22 > 1.1.1.1.37924: Flags [P.], cksum 0x5527 (incorrect -> 0x70c1), seq 1:40, ack 41, win 227, options [nop,nop,TS val 561702654 ecr 407355522], length 39
11:47:45.822645 IP (tos 0x0, ttl 64, id 21666, offset 0, flags [DF], proto TCP (6), length 40)
    1.1.1.1.37924 > 1.1.1.2.22: Flags [R], cksum 0xaeb1 (correct), seq 3260569026, win 0, length 0
11:47:50.919752 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 1.1.1.2 tell 1.1.1.1, length 46
11:47:50.919773 ARP, Ethernet (len 6), IPv4 (len 4), Reply 1.1.1.2 is-at 00:50:56:b9:3d:2b, length 28
11:47:50.948732 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 1.1.1.1 tell 1.1.1.2, length 28
11:47:50.948916 ARP, Ethernet (len 6), IPv4 (len 4), Reply 1.1.1.1 is-at 00:50:56:80:57:1a, length 46
^C
11 packets captured
11 packets received by filter
0 packets dropped by kernel

UPDATE 6

Due to the checkum error, I disabled the TCP/UDP checksum offloading to the NIC in the VM, however it did not improve the situation.

$sudo ethtool -K eth0 rx off
$sudo ethtool -K eth0 tx off

iface eth0 inet static
   address 1.1.1.2
   netmask 255.255.255.0
   network 1.1.1.0
   broadcast 1.11.1.255
   gateway 1.1.1.254
   post-up /sbin/ethtool -K $IFACE rx off
   post-up /sbin/ethtool -K $IFACE tx off

Understanding TCP Checksum Offloading (TCO) in a VMware Environment (2052904)

UPDATE 7

Disabled GSSAPIAuthentication in the ssh client in the jumpbox. Tested Enable Compression yes No change.

UPDATE 8

Testing filling up the checksum with iptables.

/sbin/iptables -A POSTROUTING -t mangle -p tcp -j CHECKSUM --checksum-fill

It did not improve the situation.

UPDATE 9:

Found an interesting test about limiting cyphers, will try it out. MTU problems does not seem the culprit as I am having problems in some cases with server and client in the same network.

For now tested in the client side "ssh -c aes256-ctr", and the symptoms do not improve.

The mysterious case of broken SSH client (“connection reset by peer”)

UPDATE 10

Added this to /etc/ssh/ssh_config. No changes.

Ciphers aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc

SSH issues: Read from socket failed: Connection reset by peer

UPDATE 11

Defined the ssh service in port 22 and port 2222. It did not work.

UPDATE 12

I suspect it being a regression bug present in OpenSSH 7.4 that was corrected with OpenSSH 7.5

Release notes from OpenSSH 7.5

  • sshd(8): Fix regression in OpenSSH 7.4 support for the
    server-sig-algs extension, where SHA2 RSA signature methods were
    not being correctly advertised. bz#2680

For using openSSH 7.5 in Debian 9/Stretch, I installed openssh-client and openssh-server from Debian testing/Buster.

No improvements on the situation.

UPDATE 13

Defined

Ciphers aes256-ctr
MACs hmac-sha1

Both at the client(s) and server side. No improvements.

UPDATE 14

Setup

UseDNS no
GSSAPIAuthentication no
GSSAPIKeyExchange no

No change.

UPDATE 15

/etc/ssh/sshd_config

Changed it to /etc/ssh/sshd_config:

TCPKeepAlive no

From How does tcp-keepalive work in ssh?

TCPKeepAlive operates on the TCP layer. It sends an empty TCP ACK
packet [from the SSH server to the client – Rui]. Firewalls can be configured to ignore these packets, so if you
go through a firewall that drops idle connections, these may not keep
the connection alive.

My guess is that TCPKeepAlive was configuring the server sending a packet that is being optimised/ignored in some layer down the stack bellow, and somewhat the remote SSH server believed it was still connected to the TCP mux client, while in fact the session was already teared down; thus the TCP reset(s) at first try.

So whilst some say that if you're using ClientAliveInterval, you can disable TCPKeepAlive, it seems to be more it you are using ClientAliveInterval you ought to disable TCPKeepAlive.

  • It is clearly this option; as for the explanation, they are mainly conjectures and will have to double check them/the source when and if I have got time.

TCPKeepAlive apparently also has spoofing issues, so it is recommended that it should be turned off.

Nevertheless, still with the problem.

Best Answer

Are you going through any FW or device attempting TCP optimisation? I've got the same experience over a network and it turned out to be a device doing TCP optimisation.

Related Question