Suppose, for example, you have a shell script similar to:
longrunningthing &
p=$!
echo Killing longrunningthing on PID $p in 24 hours
sleep 86400
echo Time up!
kill $p
Should do the trick, shouldn't it? Except that the process may have terminated early and its PID may have been recycled, meaning some innocent job get a bomb in its signal queue instead. In practice this possibly does matter, but its worrying me nonetheless. Hacking longrunningthing to drop dead by itself, or keep/remove its PID on the FS would do but I'm thinking of the generic situation here.
Best Answer
Best would be to use the
timeout
command if you have it which is meant for that:The current (8.23) GNU implementation at least works by using
alarm()
or equivalent while waiting for the child process. It does not seem to be guarding against theSIGALRM
being delivered in betweenwaitpid()
returning andtimeout
exiting (effectively cancelling that alarm). During that small window,timeout
may even write messages on stderr (for instance if the child dumped a core) which would further enlarge that race window (indefinitely if stderr is a full pipe for instance).I personally can live with that limitation (which probably will be fixed in a future version).
timeout
will also take extra care to report the correct exit status, handle other corner cases (like SIGALRM blocked/ignored on startup, handle other signals...) better than you'd probably manage to do by hand.As an approximation, you could write it in
perl
like:There's a
timelimit
command at http://devel.ringlet.net/sysutils/timelimit/ (predates GNUtimeout
by a few months).That one uses an
alarm()
-like mechanism but installs a handler onSIGCHLD
(ignoring stopped children) to detect the child dying. It also cancels the alarm before runningwaitpid()
(that doesn't cancel the delivery ofSIGALRM
if it was pending, but the way it's written, I can't see it being a problem) and kills before callingwaitpid()
(so can't kill a reused pid).netpipes also has a
timelimit
command. That one predates all the other ones by decades, takes yet another approach, but doesn't work properly for stopped commands and returns a1
exit status upon timeout.As a more direct answer to your question, you could do something like:
That is, check that the process is still a child of ours. Again, there's a small race window (in between
ps
retrieving the status of that process andkill
killing it) during which the process could die and its pid be reused by another process.With some shells (
zsh
,bash
,mksh
), you can pass job specs instead of pids.That only works if you spawn only one background job (otherwise getting the right jobspec is not always possible reliably).
If that's an issue, just start a new shell instance:
That works because the shell removes the job from the job table upon the child dying. Here, there should not be any race window since by the time the shell calls
kill()
, either the SIGCHLD signal has not been handled and the pid can't be reused (since it has not been waited for), or it has been handled and the job has been removed from the process table (andkill
would report an error).bash
'skill
at least blocks SIGCHLD before it accesses its job table to expand the%
and unblocks it after thekill()
.Another option to avoid having that
sleep
process hanging around even aftercmd
has died, withbash
orksh93
is to use a pipe withread -t
instead ofsleep
:That one still has race conditions, and you lose the command's exit status. It also assumes
cmd
doesn't close its fd 4.You could try implementing a race-free solution in
perl
like:(though it would need to be improved to handle other types of corner cases).
Another race-free method could be using process groups:
However note that using process groups can have side-effects if there's I/O to a terminal device involved. It has the additional benefit though to kill all the other extra processes spawned by
cmd
.