Process – How to Kill a Process and Ensure the PID Hasn’t Been Reused

killprocess

Suppose, for example, you have a shell script similar to:

longrunningthing &
p=$!
echo Killing longrunningthing on PID $p in 24 hours
sleep 86400
echo Time up!
kill $p

Should do the trick, shouldn't it? Except that the process may have terminated early and its PID may have been recycled, meaning some innocent job get a bomb in its signal queue instead. In practice this possibly does matter, but its worrying me nonetheless. Hacking longrunningthing to drop dead by itself, or keep/remove its PID on the FS would do but I'm thinking of the generic situation here.

Best Answer

Best would be to use the timeout command if you have it which is meant for that:

timeout 86400 cmd

The current (8.23) GNU implementation at least works by using alarm() or equivalent while waiting for the child process. It does not seem to be guarding against the SIGALRM being delivered in between waitpid() returning and timeout exiting (effectively cancelling that alarm). During that small window, timeout may even write messages on stderr (for instance if the child dumped a core) which would further enlarge that race window (indefinitely if stderr is a full pipe for instance).

I personally can live with that limitation (which probably will be fixed in a future version). timeout will also take extra care to report the correct exit status, handle other corner cases (like SIGALRM blocked/ignored on startup, handle other signals...) better than you'd probably manage to do by hand.

As an approximation, you could write it in perl like:

perl -MPOSIX -e '
  $p = fork();
  die "fork: $!\n" unless defined($p);
  if ($p) {
    $SIG{ALRM} = sub {
      kill "TERM", $p;
      exit 124;
    };
    alarm(86400);
    wait;
    exit (WIFSIGNALED($?) ? WTERMSIG($?)+128 : WEXITSTATUS($?))
  } else {exec @ARGV}' cmd

There's a timelimit command at http://devel.ringlet.net/sysutils/timelimit/ (predates GNU timeout by a few months).

 timelimit -t 86400 cmd

That one uses an alarm()-like mechanism but installs a handler on SIGCHLD (ignoring stopped children) to detect the child dying. It also cancels the alarm before running waitpid() (that doesn't cancel the delivery of SIGALRM if it was pending, but the way it's written, I can't see it being a problem) and kills before calling waitpid() (so can't kill a reused pid).

netpipes also has a timelimit command. That one predates all the other ones by decades, takes yet another approach, but doesn't work properly for stopped commands and returns a 1 exit status upon timeout.

As a more direct answer to your question, you could do something like:

if [ "$(ps -o ppid= -p "$p")" -eq "$$" ]; then
  kill "$p"
fi

That is, check that the process is still a child of ours. Again, there's a small race window (in between ps retrieving the status of that process and kill killing it) during which the process could die and its pid be reused by another process.

With some shells (zsh, bash, mksh), you can pass job specs instead of pids.

cmd &
sleep 86400
kill %
wait "$!" # to retrieve the exit status

That only works if you spawn only one background job (otherwise getting the right jobspec is not always possible reliably).

If that's an issue, just start a new shell instance:

bash -c '"$@" & sleep 86400; kill %; wait "$!"' sh cmd

That works because the shell removes the job from the job table upon the child dying. Here, there should not be any race window since by the time the shell calls kill(), either the SIGCHLD signal has not been handled and the pid can't be reused (since it has not been waited for), or it has been handled and the job has been removed from the process table (and kill would report an error). bash's kill at least blocks SIGCHLD before it accesses its job table to expand the % and unblocks it after the kill().

Another option to avoid having that sleep process hanging around even after cmd has died, with bash or ksh93 is to use a pipe with read -t instead of sleep:

{
  {
    cmd 4>&1 >&3 3>&- &
    printf '%d\n.' "$!"
  } | {
    read p
    read -t 86400 || kill "$p"
  }
} 3>&1

That one still has race conditions, and you lose the command's exit status. It also assumes cmd doesn't close its fd 4.

You could try implementing a race-free solution in perl like:

perl -MPOSIX -e '
   $p = fork();
   die "fork: $!\n" unless defined($p);
   if ($p) {
     $SIG{CHLD} = sub {
       $ss = POSIX::SigSet->new(SIGALRM); $oss = POSIX::SigSet->new;
       sigprocmask(SIG_BLOCK, $ss, $oss);
       waitpid($p,WNOHANG);
       exit (WIFSIGNALED($?) ? WTERMSIG($?)+128 : WEXITSTATUS($?))
           unless $? == -1;
       sigprocmask(SIG_UNBLOCK, $oss);
     };
     $SIG{ALRM} = sub {
       kill "TERM", $p;
       exit 124;
     };
     alarm(86400);
     pause while 1;
   } else {exec @ARGV}' cmd args...

(though it would need to be improved to handle other types of corner cases).

Another race-free method could be using process groups:

set -m
((sleep 86400; kill 0) & exec cmd)

However note that using process groups can have side-effects if there's I/O to a terminal device involved. It has the additional benefit though to kill all the other extra processes spawned by cmd.