Bash – Predicting the PID of previously started SSH command

bashshell-scriptsshssh-tunneling

This is the weirdest thing.

In a script, I start an SSH tunnel like so:

ssh -o StrictHostkeyChecking=no -fND 8080 foo@bar

This starts an ssh instance that goes into the background, and script execution continues. Next, I save its PID (for killing it later) by using bash's $! variable. For this to work, I append & to the ssh command even though it already goes into the background by itself (otherwise $! doesn't contain anything). Thus, for example the following script:

#!/bin/bash

ssh -o StrictHostkeyChecking=no -fND 8080 foo@bar &
echo $!
pgrep -f "ssh -o StrictHostkeyChecking=no -fND 8080 foo@bar"

outputs

(some ssh output)
28062
28062

…two times the same PID, as expected. But, now, when I execute this exact sequence of commands from the terminal, the PID output by $! is wrong (in the sense that it is not the PID of the ssh instance). From the terminal:

$ ssh -o StrictHostkeyChecking=no -fND 8080 foo@bar &
[1] 28178
(some ssh output)
$ echo $!
28178
$ pgrep -f "ssh -o StrictHostkeyChecking=no -fND 8080 foo@bar"
28181

It's not always 3 numbers apart, either. I've also observed a difference of 1 or 2. But it is never the same PID, as I would have expected and as is indeed the case when this sequence of commands is run within a script.

Can someone explain why this is happening? I thought it might be due to the initial ssh call actually forking another process, but then why does it work from within a script?
This also made me doubt whether using $! in my script to get the ssh PID as described above will indeed always work (though it has so far). Is this indeed reliable? I felt it was "cleaner" than using pgrep…

Best Answer

The shell's $! variable only knows the pid of the process started by the shell. As you suspected, the ssh call using -f forks its own process so it can go to background, so the overall process tree looks like [1]:

shell
|
+--ssh<1> (pid is $!)
   |
   +--ssh<2> (pid is different)

ssh<1> exits very shortly after invocation; therefore, the value in $! is unlikely to be useful. It's ssh<2> which is carrying on the remote communication and doing your tunneling for you, and the only way to reliably get its PID is by examining the process table, as you're doing with pgrep [2]. The pgrep method is likely to be the correct one here.

As to why it works in the script but not interactively, this is probably a race condition. Because you put the first ssh in the background, the shell and ssh are executing concurrently, and ssh does some moderately CPU-heavy cryptographic authentication and some network roundtrips. It's likely that the pgrep you run in the script is simply running before ssh<1> forks itself to go to background. To get around this, run pgrep later, either by a sleep call, or just by calling it only when you actually need the PID later on.

[1]: Technically, it might be more complicated than this, if ssh is using a classic double-fork to background. In that case, there would be another, ephemeral ssh process between the two.

[2]: Unless you're systemd and you're using cgroups or something to keep track of all your children. Which you aren't.

Foreground processes and terminal access control

To understand what is going on, you need to know a little about sharing terminals. What happens when two programs try to read from the same terminal at the same time? Each input byte goes randomly to one of the programs. (Not random as in the kernel uses an RNG to decide, just random as in unpredictable in practice.) The same thing happens when two programs read from a pipe, or any other file type which is a stream of bytes being moved from one place to another (socket, character device, …), rather than a byte array where any byte can be read multiple times (regular file, block device). For example, run a shell in a terminal, figure out the name of the terminal and run cat.

$ tty
/dev/pts/18
$ cat

Then from another terminal, run cat /dev/pts/18. Now type in the terminal, and watch as lines sometimes go to one of the cat processes and sometimes to the other. Lines are dispatched as a whole when the terminal is in cooked mode. If you put the terminal in raw mode then each byte would be dispatched independently.

That's messy. Surely there should be a mechanism to decide that one program gets the terminal, and the others don't. Well, there is! It triggers in typical cases, but not in the scenario I set up above. That scenario is unusual because cat /dev/pts/18 wasn't started from /dev/pts/18. It's unusual to access a terminal from a program that wasn't started inside this terminal. In the usual case, you run a shell in a terminal, and you run programs from that shell. Then the rule is that the program in the foreground gets the terminal, and programs in the background don't. This is known as terminal access control. The way it works is:

Each process has a controlling terminal (or doesn't have one, typically because it doesn't have any open file descriptor that's a terminal).
When a process tries to access its controlling terminal, if the process is not in the foreground, then the kernel blocks it. (Conditions apply. Access to other terminals is not regulated.)
The shell decides who is the foreground process. (Foreground process group, actually.) It calls the tcsetpgrp to let the kernel know who should be in the foreground.

This works in typical cases. Run a program in a shell, and that program gets to be the foreground process. Run a program in the background (with &), and the program doesn't get to be in the foreground. When the shell is displaying a prompt, the shell puts itself in the foreground. When you resume a suspended job with fg, the job gets to be in the foreground. With bg, it doesn't.

If a background process tries to read from the terminal, the kernel sends it a SIGTTIN signal. The default action of the signal is to suspend the process (like SIGSTOP). The parent of the process can know about this by calling waitpid with the WSTOPPED flag; when a child process receives a signal that suspends it, the waitpid call in the parent returns and lets the parent know what the signal was. This is how the shell knows to print “Stopped (tty input)”. What it's telling you is that this job is suspended due to a SIGTTIN.

Since the process is suspended, nothing will happen to it until it's resumed or killed (with a signal that the process doesn't catch, because if the process has set a signal handler, it won't run since the process is suspended). You can resume the process by sending it a SIGCONT, but that won't achieve anything if the process is reading from the terminal, it'll receive another SIGTTIN immediately. If you resume the process with fg, it goes to the foreground and so the read succeeds.

Now you understand what happens when you run cat in the background:

$ cat &
$ 
[1] + Stopped (tty input)        cat
$

The case of SSH

Now let's do the same thing with SSH.

$ ssh localhost sleep 999999 &
$ 
$ 
$ 
[1] + Stopped (tty input)        ssh localhost sleep 999999
$

Pressing Enter sometimes goes to the shell (which is in the foreground), and sometimes to the SSH process (at which point it gets stopped by SIGTTIN). Why? If ssh was reading from the terminal, it should receive SIGTTIN immediately, and if it wasn't then why does it receive SIGTTIN?

What's happening is that the SSH process calls the select system call to know when input is available on any of the files it's interested in (or if an output file is ready to receive more data). The input sources include at least the terminal and the network socket. Unlike read, select is not forbidden to background processes, and ssh doesn't receive a SIGTTIN when it calls select. The intent of select is to find out whether data is available, without disrupting anything. Ideally select would not change the system state at all, but in fact this isn't completely true. When select tells the SSH process that input is available on the terminal file descriptor, the kernel has to commit to sending input if the process calls read afterwards. (If it didn't, and the process called read, then there might be no input available at this point, so the return value from select would have been a lie.) So if the kernel decides to route some input to the SSH process, it decides by the time the select system call returns. Then SSH calls read, and at that point the kernel sees that a background process tried to read from the terminal and suspends it with SIGTTIN.

Note that you don't need to launch multiple connections to the same server. One is enough. Multiple connections merely increases the probability that the problem arises.

The solution: don't read from the terminal

If you need the SSH session to read from the terminal, run it in the foreground.

If you don't need the SSH session to read from the terminal, make sure that its input is not coming from the terminal. There are two ways to do this:

You can redirect the input:
```
ssh … </dev/null
```
You can instruct SSH not to forward a terminal connection with -n or -f. (-n is equivalent to </dev/null; -f allows SSH itself to read from the terminal, e.g. to read a password, but the command itself won't have the terminal open.)
```
ssh -n …
```

Note that the disconnection between the terminal and SSH has to happen on the client. The sleep process running on the server will never read from the terminal, but SSH has no way to know that. If the client receives input on standard input, it must forward it to the server, which will make the data available in a buffer in case the application ever decides to read it (and if the application calls select, it'll be informed that data is available).

Best Answer

Related Solutions

Bash – Get PID and return code from 1 line bash call

SSH Connections – Running in Background Don’t Exit

Foreground processes and terminal access control

The case of SSH

The solution: don't read from the terminal

Related Question