Process – How to Kill a Process and Ensure the PID Hasn’t Been Reused

killprocess

Suppose, for example, you have a shell script similar to:

longrunningthing &
p=$!
echo Killing longrunningthing on PID $p in 24 hours
sleep 86400
echo Time up!
kill $p

Should do the trick, shouldn't it? Except that the process may have terminated early and its PID may have been recycled, meaning some innocent job get a bomb in its signal queue instead. In practice this possibly does matter, but its worrying me nonetheless. Hacking longrunningthing to drop dead by itself, or keep/remove its PID on the FS would do but I'm thinking of the generic situation here.

Best Answer

Best would be to use the timeout command if you have it which is meant for that:

timeout 86400 cmd

The current (8.23) GNU implementation at least works by using alarm() or equivalent while waiting for the child process. It does not seem to be guarding against the SIGALRM being delivered in between waitpid() returning and timeout exiting (effectively cancelling that alarm). During that small window, timeout may even write messages on stderr (for instance if the child dumped a core) which would further enlarge that race window (indefinitely if stderr is a full pipe for instance).

I personally can live with that limitation (which probably will be fixed in a future version). timeout will also take extra care to report the correct exit status, handle other corner cases (like SIGALRM blocked/ignored on startup, handle other signals...) better than you'd probably manage to do by hand.

As an approximation, you could write it in perl like:

perl -MPOSIX -e '
  $p = fork();
  die "fork: $!\n" unless defined($p);
  if ($p) {
    $SIG{ALRM} = sub {
      kill "TERM", $p;
      exit 124;
    };
    alarm(86400);
    wait;
    exit (WIFSIGNALED($?) ? WTERMSIG($?)+128 : WEXITSTATUS($?))
  } else {exec @ARGV}' cmd

There's a timelimit command at http://devel.ringlet.net/sysutils/timelimit/ (predates GNU timeout by a few months).

 timelimit -t 86400 cmd

That one uses an alarm()-like mechanism but installs a handler on SIGCHLD (ignoring stopped children) to detect the child dying. It also cancels the alarm before running waitpid() (that doesn't cancel the delivery of SIGALRM if it was pending, but the way it's written, I can't see it being a problem) and kills before calling waitpid() (so can't kill a reused pid).

netpipes also has a timelimit command. That one predates all the other ones by decades, takes yet another approach, but doesn't work properly for stopped commands and returns a 1 exit status upon timeout.

As a more direct answer to your question, you could do something like:

if [ "$(ps -o ppid= -p "$p")" -eq "$$" ]; then
  kill "$p"
fi

That is, check that the process is still a child of ours. Again, there's a small race window (in between ps retrieving the status of that process and kill killing it) during which the process could die and its pid be reused by another process.

With some shells (zsh, bash, mksh), you can pass job specs instead of pids.

cmd &
sleep 86400
kill %
wait "$!" # to retrieve the exit status

That only works if you spawn only one background job (otherwise getting the right jobspec is not always possible reliably).

If that's an issue, just start a new shell instance:

bash -c '"$@" & sleep 86400; kill %; wait "$!"' sh cmd

That works because the shell removes the job from the job table upon the child dying. Here, there should not be any race window since by the time the shell calls kill(), either the SIGCHLD signal has not been handled and the pid can't be reused (since it has not been waited for), or it has been handled and the job has been removed from the process table (and kill would report an error). bash's kill at least blocks SIGCHLD before it accesses its job table to expand the % and unblocks it after the kill().

Another option to avoid having that sleep process hanging around even after cmd has died, with bash or ksh93 is to use a pipe with read -t instead of sleep:

{
  {
    cmd 4>&1 >&3 3>&- &
    printf '%d\n.' "$!"
  } | {
    read p
    read -t 86400 || kill "$p"
  }
} 3>&1

That one still has race conditions, and you lose the command's exit status. It also assumes cmd doesn't close its fd 4.

You could try implementing a race-free solution in perl like:

perl -MPOSIX -e '
   $p = fork();
   die "fork: $!\n" unless defined($p);
   if ($p) {
     $SIG{CHLD} = sub {
       $ss = POSIX::SigSet->new(SIGALRM); $oss = POSIX::SigSet->new;
       sigprocmask(SIG_BLOCK, $ss, $oss);
       waitpid($p,WNOHANG);
       exit (WIFSIGNALED($?) ? WTERMSIG($?)+128 : WEXITSTATUS($?))
           unless $? == -1;
       sigprocmask(SIG_UNBLOCK, $oss);
     };
     $SIG{ALRM} = sub {
       kill "TERM", $p;
       exit 124;
     };
     alarm(86400);
     pause while 1;
   } else {exec @ARGV}' cmd args...

(though it would need to be improved to handle other types of corner cases).

Another race-free method could be using process groups:

set -m
((sleep 86400; kill 0) & exec cmd)

However note that using process groups can have side-effects if there's I/O to a terminal device involved. It has the additional benefit though to kill all the other extra processes spawned by cmd.

Related Solutions

Process – How to Kill Both Process and Subprocess

There is a standard method, if the programs cooperate. Run kill -- -42 where 42 is the pid of the parent process. This sends a signal to all the processes in the process group lead by process 42 (the minus sign before the pid means process group).

Normally, if you run your python script from a shell prompt and it simply forks gnuchess, the two processes should remain in the same process group. But this doesn't seem to be the case, since Ctrl+C sends SIGINT to the whole foreground process group.

Gnuchess might be in its own process group because it made itself a session leader (but I don't know why it would do this), or because you've double-forked it (python forks a shell which forks gnuchess). A double fork is probably avoidable, but I can't tell you how without seeing your code.

A reasonably reliable and POSIX-compliant way of finding the pid of the gnuchess process is

gnuchess_pids=$(ps -A -o pid= -o cmd= | awk '$2 ~ /(^|\/)gnuchess$/ {print $1}')

Specific unix variants may have better ways of achieving this, such as pgrep.

How to monitor or kill a process which has been started by cron

You can kill processes by name. For example, on Linux, *BSD and Solaris, pkill myprogram kills all the processes whose name contains myprogram (use pkill '^myprogram$' for an exact match). If you run it as a non-root user, only that user's processes will be killed, and there are further options to control matching (see the manual on your system for details).

If you want to specifically target processes started by the scheduler, and you're killing the processes manually, you can run ps f (Linux only) or pstree (Linux only) or ptree to display the processes in a tree, and see which processes were started by cron.

If you want to be able to kill these processes automatically in a homemade method, make them store their process ID in a file. This kind of file is called a pidfile when it's used to only have a single instance of the process running (which may or may not be something you want). If you want multiple instances, store the PIDs in separate files in a common directory; here's a shell snippet that does this:

pid_dir=/var/run/myprogram # must have been created e.g. at boot time
myprogram &
pid_file=$pid_dir/$!.pid
touch "$pid_file"
wait
rm "$pid_file"

A better solution, if you have hard criteria to detect runaway processes, is to use a general monitoring program, or in simple cases just put a limit on how long the process is allowed to run. You may find these links helpful:

“Monit can start a process if it does not run, restart a process if it does not respond and stop a process if it uses too much resources.”
I need help with a cronjob to watch for runaway processes and kill them
How to limit resource usage to save CPU+RAM for a certain process?
Is there a way to limit the amount of memory a particular process can use in Unix?

Best Answer

Related Solutions

Process – How to Kill Both Process and Subprocess

How to monitor or kill a process which has been started by cron

Related Question