...in shell (I am only interested to Bash) scripts I cannot be 100% sure that a PID I stored in a variable still refers to the background process I started, since it might be reused by the kernel for any other process...
Correct.
The way shells are programmed, as soon as a child process dies, the shell will call wait()
on it right away (storing the termination status as part of its internal state), which will free the PID for reuse by another process.
is it true in shell scripts that "If you're still in the parent process that started the child process [...] You're guaranteed the PID is your child process (dead or alive)"?
No, it's not true.
Because, as mentioned earlier (and in the quote), the shell itself will reap children processes right away, which basically destroys that guarantee.
the naive way of starting in background a process in a script, storing its PID in a variable, do some stuff and then use the PID in combination with wait to get its exit code or with kill to send signals might fail due to reusage by the kernel of the PID.
Using a shell, that's the best you can do.
Note that using wait
is not really a problem, only using kill
, since it's possible your child process has already died, the PID has been reused and you're killing a different process.
wait
itself is implemented in the shell. When it reaps children processes, it will store that termination status in memory, so it can implement its wait
built-in using that information (as well as waiting for children processes which are still running.)
Also note that the kernel will typically try hard to avoid reusing PIDs, at least try to delay reusing a PID, exactly because in some cases there are no guarantees that the PID hasn't been reused, so the kernel tries to minimize this situation where a signal will be delivered to the wrong process.
Is there a general recipe?
For reliability?
Yes, implement the code launching a background process in C. Or Python, Perl, Ruby, etc. Not in shell.
Those will not have this problem, since they won't reap children by default, like the shell does, you'll have to do it explicitly there.
Or consider launching background processes using a system manager (such as systemd.)
"to have background (process) store the return code in a file and have parent fetch it from file". Is this the reliable way to go?
Maybe.
You have fewer guarantees that there hasn't been interference there. It's hard to have a location where only that single process can write and no others.
The same isn't true with the wait
call, the kernel ensures it can't be faked by a different process.
Furthermore, the wait
call can also tell you about the process being killed or even crashing, in which case you would probably get incomplete information if you depended on the process itself to record its return status in a file...
Also, the main issue with PID reuse is killing that PID, there's really no issue with getting the return code through wait
, and the problem with kill
isn't really addressed by using a file to store the return code.
Are there caveats about the usage of wait -n
?
Not really. wait
is reliable and AFAICT isn't affected by PID reuse, since when the shell reaps a child it will keep that information, including the PID that was being used and the return code, as part of its internal state.
When you call wait
, you'll get information from that table.
I think there might be one potential issue if the PID is reused by a new background child of that same shell, before wait
has been called on the first instance, since then there will be a clash in that table and you'll end up with two separate background processes with the same PID. That's a corner case and I imagine it's very very rare, but potentially real. Not really sure what the shell would do in those cases... It also probably depends on the implementation of the shell and might vary between versions.
As said, though, the real solution for this issue is to maintain the guarantees about PIDs being around, by using something other than a shell when those guarantees are important to you.
What is a direct child? Is it different from a child?
It's the same as a child.
It's a child you forked yourself.
For instance, if your child process forks a process and passes you the PID of that process, you no longer have the guarantee that it will stay around.
Since it's your child's job to reap that process, then it's your child who can have the guarantee that PID won't be reused until they've reaped it. Not yours.
Of course, a parent process could coordinate with a child to extend that guarantee, for instance by preventing it from reaping any children while it queries the child whether that PID is stilll the one it expects and then sending it a signal, or perhaps by asking the children (who has the guarantee) to send the signal on the parent's behalf.
Hopefully this will help clear it up.
Best Answer
Try: