Bash – When is a PID Exactly Freed Up?

background-processbashprocess

Disclaimer: This question came up much longer than expected. I split it in 5 sub-questions. I really tried to clarify my mind before opening it, but too many aspects are confusing me at the moment.


Trying to clarify my mind about how to correctly handle processes in Bash in a solid way, I stumbled on this Greg's Wiki article. There, rather at the beginning, there is this statement

If you're still in the parent process that started the child process
you want to do something with, that's perfect. You're guaranteed the
PID is your child process (dead or alive), for the reasons explained
below. You can use kill to signal it, terminate it, or just check
whether it's still running. You can use wait to wait for it to end or
to get its exit code if it has ended.

Towards the end of the page, the above-mentioned reasons explained below are found.

Each UNIX process also has a parent process. This parent process is
the process that started it, but can change to the init process if the
parent process ends before the new process does. (That is, init will
pick up orphaned processes.) Understanding this parent/child
relationship is vital because it is the key to reliable process
management in UNIX. A process's PID will NEVER be freed up for use
after the process dies UNTIL the parent process waits for the PID to
see whether it ended and retrieve its exit code. If the parent ends,
the process is returned to init, which does this for you.

This is important for one major reason: if the parent process manages
its child process, it can be absolutely certain that, even if the
child process dies, no other new process can accidentally recycle the
child process's PID until the parent process has waited for that PID
and noticed the child died. This gives the parent process the
guarantee that the PID it has for the child process will ALWAYS point
to that child process, whether it is alive or a "zombie". Nobody else
has that guarantee.

Unfortunately, this guarantee doesn't apply to shell scripts. Shells
aggressively reap their child processes and store the exit status in
memory, where it becomes available to your script upon calling wait.
But because the child has already been reaped before you call wait,
there is no zombie to hold the PID. The kernel is free to reuse that
PID, and your guarantee has been violated.


I read the above paragraphs several times by now, but I am still not sure I am grasping correctly the message behind it.

Question 1: From the second long quote, and in particular from its last paragraph, I would conclude that in shell (I am only interested to Bash) scripts I cannot be 100% sure that a PID I stored in a variable still refers to the background process I started, since it might be reused by the kernel for any other process (even not a child). Is this correct? Where in the system does the above-mentioned guarantee apply?

Question 2: It seems that the last paragraph of the second quote is in contradiction with the first quote. In general, is it true in shell scripts that "If you're still in the parent process that started the child process […] You're guaranteed the PID is your child process (dead or alive)"?

Question 3: I tried to find other sources around in the web about this topic and, as always, it is hard to distinguish about truth and inaccurate statements. I got some confirmation but also some more doubts. Referring to this and this questions, it seems that the naive way of starting in background a process in a script, storing its PID in a variable, do some stuff and then use the PID in combination with wait to get its exit code or with kill to send signals might fail due to reusage by the kernel of the PID. Is there a general recipe?

Question 4: I also found this comment that suggests "to have background (process) store the return code in a file and have parent fetch it from file". Is this the reliable way to go?

Question 5: Are there caveats about the usage of wait -n? I would think, if I do not explicitly give (potentially reused) PID to wait, nothing wrong should happen. However, it seems that in Bash v4.4 the -n option of wait is useful with job control enabled, set -m. Is it still the case in Bash v5.0?

Bonus question: This answer says something similar to Greg's Wiki.

There is only one case in which you can safely use the pid to send
signals: when the target process is a direct child of the process that
will be sending the signal, and the parent has not yet waited on it.

What is a direct child? Is it different from a child?

Best Answer

...in shell (I am only interested to Bash) scripts I cannot be 100% sure that a PID I stored in a variable still refers to the background process I started, since it might be reused by the kernel for any other process...

Correct.

The way shells are programmed, as soon as a child process dies, the shell will call wait() on it right away (storing the termination status as part of its internal state), which will free the PID for reuse by another process.

is it true in shell scripts that "If you're still in the parent process that started the child process [...] You're guaranteed the PID is your child process (dead or alive)"?

No, it's not true.

Because, as mentioned earlier (and in the quote), the shell itself will reap children processes right away, which basically destroys that guarantee.

the naive way of starting in background a process in a script, storing its PID in a variable, do some stuff and then use the PID in combination with wait to get its exit code or with kill to send signals might fail due to reusage by the kernel of the PID.

Using a shell, that's the best you can do.

Note that using wait is not really a problem, only using kill, since it's possible your child process has already died, the PID has been reused and you're killing a different process.

wait itself is implemented in the shell. When it reaps children processes, it will store that termination status in memory, so it can implement its wait built-in using that information (as well as waiting for children processes which are still running.)

Also note that the kernel will typically try hard to avoid reusing PIDs, at least try to delay reusing a PID, exactly because in some cases there are no guarantees that the PID hasn't been reused, so the kernel tries to minimize this situation where a signal will be delivered to the wrong process.

Is there a general recipe?

For reliability?

Yes, implement the code launching a background process in C. Or Python, Perl, Ruby, etc. Not in shell.

Those will not have this problem, since they won't reap children by default, like the shell does, you'll have to do it explicitly there.

Or consider launching background processes using a system manager (such as systemd.)

"to have background (process) store the return code in a file and have parent fetch it from file". Is this the reliable way to go?

Maybe.

You have fewer guarantees that there hasn't been interference there. It's hard to have a location where only that single process can write and no others.

The same isn't true with the wait call, the kernel ensures it can't be faked by a different process.

Furthermore, the wait call can also tell you about the process being killed or even crashing, in which case you would probably get incomplete information if you depended on the process itself to record its return status in a file...

Also, the main issue with PID reuse is killing that PID, there's really no issue with getting the return code through wait, and the problem with kill isn't really addressed by using a file to store the return code.

Are there caveats about the usage of wait -n?

Not really. wait is reliable and AFAICT isn't affected by PID reuse, since when the shell reaps a child it will keep that information, including the PID that was being used and the return code, as part of its internal state.

When you call wait, you'll get information from that table.

I think there might be one potential issue if the PID is reused by a new background child of that same shell, before wait has been called on the first instance, since then there will be a clash in that table and you'll end up with two separate background processes with the same PID. That's a corner case and I imagine it's very very rare, but potentially real. Not really sure what the shell would do in those cases... It also probably depends on the implementation of the shell and might vary between versions.

As said, though, the real solution for this issue is to maintain the guarantees about PIDs being around, by using something other than a shell when those guarantees are important to you.

What is a direct child? Is it different from a child?

It's the same as a child.

It's a child you forked yourself.

For instance, if your child process forks a process and passes you the PID of that process, you no longer have the guarantee that it will stay around.

Since it's your child's job to reap that process, then it's your child who can have the guarantee that PID won't be reused until they've reaped it. Not yours.

Of course, a parent process could coordinate with a child to extend that guarantee, for instance by preventing it from reaping any children while it queries the child whether that PID is stilll the one it expects and then sending it a signal, or perhaps by asking the children (who has the guarantee) to send the signal on the parent's behalf.

Hopefully this will help clear it up.

Related Question