Get the exit code of processes forked from the master process

background-processjob-controlprocess

I have a master process (run-jobs below) that starts other jobs as its sub-processes. When the master process fails (e.g. database failure), it exits with a non-0 status code, which is good, and can be verified by looking into $? variable (echo $?).

However, I'd also like to inspect the exit codes of the sub-processes in case the master job fails. Is there a convenient way to check the exit code of process_1 and process_2 below, once the master process is gone?

This is simplified output of ps auxf:

vagrant 5167 | \_ php app/console run-jobs vagrant 5461 | \_ php process_1 vagrant 5517 | \_ php process_2

Best Answer

Processes report their exit status to their parent and if their parent is dead to the process of id 1 (init), though with recent versions of Linux (3.4 or above), you can designate another ancestor as the child subreaper for that role (using prctl(PR_SET_CHILD_SUBREAPER)).

Actually, after they die, processes become zombies until their parent (or init) retrieves their exit status (with waitpid() or other).

In your case, you're saying the children are dying after (as a result of?) run-jobs dying. That means they'll report their exit status to init or to the process designated as child sub-reaper.

If init doesn't log that (and it generally doesn't) and if you don't use auditing or process accounting, that exit status will be lost.

If on a recent version of Linux, you can create your own sub-reaper to get the pid and exit status of those orphan processes. Like with perl:

$ perl -MPOSIX -le '
  require "syscall.ph";
  syscall(&SYS_prctl,36,1) >= 0 or die "cannot set subreaper: $!";

  # example running 1 child and 2 grand children:
  if (!fork) {
    # There, you would run:
    # exec("php", "run-jobs");
    if (!fork) {exec "sleep 1; exit 12"};
    if (!fork) {exec "sleep 2; exit 123"};
    exit(88)
  }
  # now reporting on all children and grand-children:
  while (($pid = wait) > 0) {
   print "$pid: " . WEXITSTATUS($?)
  }'
22425: 88
22426: 12
22427: 123

If you wanted to retrieve information on the dying processes (like command line, user, ppid...), you'd need to do that while they're still in the zombie state, that is before you've done a wait() on them.

To do that you'd need to use the waitid() API with the WNOWAIT option (and then get the information from /proc or the ps command). I don't think perl has an interface to that though, so you'd need to write it in another language like C.

Related Solutions

Apache – Unkillable apache2 processes

First, check your RAM.

A process that doesn't respond to SIGKILL is a symptom of either a kernel bug or a hardware bug. When you haven't just changed your kernel, the most likely reason is that your RAM is failing, so check it. Kernel bugs can have subtle causes (such as using the wrong version of gcc) and manifest themselves subtly (such as working perfectly except that the X server wouldn't start — same true story). It's not very likely that your new kernel is buggy, if you're using the distribution-provided kernel that a lot of other users are using, but it could happen — possibly a rare bug triggered by a combination of drivers and activity patterns. Try another kernel.

There may also be a bug in Apache that causes it to crash, but if SIGKILL doesn't work, it's not Apache's fault.

Shell – What happens to background jobs after exiting the shell

When the shell exits, it might send the HUP signal to background jobs, and this might cause them to exit. The SIGHUP signal is only sent if the shell itself receives a SIGHUP, i.e. only if the terminal goes away (e.g. because the terminal emulator process dies) and not if you exit the shell normally (with the exit builtin or by typing Ctrl+D). See In which cases is SIGHUP not sent to a job when you log out? and Is there any UNIX variant on which a child process dies with its parent? for more details. In bash, you can set the huponexit option to also send SIGHUP to background jobs on a normal exit. In ksh, bash and zsh, calling disown on a job removes it from the list of jobs to send SIGHUP to. A process that receives SIGHUP may ignore or catch the signal, and then it won't die. Using nohup when you run a program makes it immune to SIGHUP.

If the process isn't killed due to a possible SIGHUP then it remains behind. There's nothing left to relate it to job numbers in the shell.

The process may still die if it tries to access the terminal but the terminal no longer exists. That depends how the program reacts to a non-existent terminal.

If the job contains multiple processes (e.g. a pipeline), then all these processes are in one process group. Process groups were invented precisely to capture the notion of a shell job that is made up of multiple related processes. You can see processes grouped by process group by displaying their process group ID (PGID — normally the process ID of the first process in the group), e.g. with ps l under Linux or something like ps -o pid,pgid,tty,etime,comm portably.

You can kill all the processes in a group by passing a negative argument to kill. For example, if you've determined that the PGID for the pipeline you want to kill is 1234, then you can kill it with

kill -TERM -1234

Best Answer

Related Solutions

Apache – Unkillable apache2 processes

Shell – What happens to background jobs after exiting the shell

Related Question