Bash – Predicting the PID of previously started SSH command

bashshell-scriptsshssh-tunneling

This is the weirdest thing.

In a script, I start an SSH tunnel like so:

ssh -o StrictHostkeyChecking=no -fND 8080 foo@bar

This starts an ssh instance that goes into the background, and script execution continues. Next, I save its PID (for killing it later) by using bash's $! variable. For this to work, I append & to the ssh command even though it already goes into the background by itself (otherwise $! doesn't contain anything). Thus, for example the following script:

#!/bin/bash

ssh -o StrictHostkeyChecking=no -fND 8080 foo@bar &
echo $!
pgrep -f "ssh -o StrictHostkeyChecking=no -fND 8080 foo@bar"

outputs

(some ssh output)
28062
28062

…two times the same PID, as expected. But, now, when I execute this exact sequence of commands from the terminal, the PID output by $! is wrong (in the sense that it is not the PID of the ssh instance). From the terminal:

$ ssh -o StrictHostkeyChecking=no -fND 8080 foo@bar &
[1] 28178
(some ssh output)
$ echo $!
28178
$ pgrep -f "ssh -o StrictHostkeyChecking=no -fND 8080 foo@bar"
28181

It's not always 3 numbers apart, either. I've also observed a difference of 1 or 2. But it is never the same PID, as I would have expected and as is indeed the case when this sequence of commands is run within a script.

  1. Can someone explain why this is happening? I thought it might be due to the initial ssh call actually forking another process, but then why does it work from within a script?

  2. This also made me doubt whether using $! in my script to get the ssh PID as described above will indeed always work (though it has so far). Is this indeed reliable? I felt it was "cleaner" than using pgrep

Best Answer

The shell's $! variable only knows the pid of the process started by the shell. As you suspected, the ssh call using -f forks its own process so it can go to background, so the overall process tree looks like [1]:

shell
|
+--ssh<1> (pid is $!)
   |
   +--ssh<2> (pid is different)

ssh<1> exits very shortly after invocation; therefore, the value in $! is unlikely to be useful. It's ssh<2> which is carrying on the remote communication and doing your tunneling for you, and the only way to reliably get its PID is by examining the process table, as you're doing with pgrep [2]. The pgrep method is likely to be the correct one here.

As to why it works in the script but not interactively, this is probably a race condition. Because you put the first ssh in the background, the shell and ssh are executing concurrently, and ssh does some moderately CPU-heavy cryptographic authentication and some network roundtrips. It's likely that the pgrep you run in the script is simply running before ssh<1> forks itself to go to background. To get around this, run pgrep later, either by a sleep call, or just by calling it only when you actually need the PID later on.

[1]: Technically, it might be more complicated than this, if ssh is using a classic double-fork to background. In that case, there would be another, ephemeral ssh process between the two.

[2]: Unless you're systemd and you're using cgroups or something to keep track of all your children. Which you aren't.

Related Question