A few years ago, a coworker came up with an elegant solution for a watchdog program. The program ran on Windows and used Windows Event objects to monitor the process handles (PID’s) of several applications. If any one of the processes terminated unexpectedly, its process handle would no longer exist and his watchdog would immediately be signaled. The watchdog would then take an appropriate action to “heal” the system.
My question is, how would you implement such a watchdog on Linux? Is there a way for a single program to monitor the PID’s of many others?
Best Answer
The traditional, portable, commonly-used way is that the parent process watches over its children.
The basic primitives are the
wait
andwaitpid
system calls. When a child process dies, the parent process receives aSIGCHLD
signal, telling it it should callwait
to know which child died and its exit status. The parent process can instead choose to ignoreSIGCHLD
and callwaitpid(-1, &status, WNOHANG)
at its convenience.To monitor many processes, you would either spawn them all from the same parent, or invoke them all through a simple monitoring process that just calls the desired program, waits for it to terminate and reports on the termination (in shell syntax:
myprogram; echo myprogram $? >>/var/run/monitor-collector-pipe
). If you're coming from the Windows world, note that having small programs doing one specialized task is a common design in the Unix world, the OS is designed to make processes cheap.There are many process monitoring (also called supervisor) programs that can report when a process dies and optionally restart it and far more besides: Monit, Supervise, Upstart, …