Linux: Writing a watchdog to monitor multiple processes

linuxmonitoringprocess

A few years ago, a coworker came up with an elegant solution for a watchdog program. The program ran on Windows and used Windows Event objects to monitor the process handles (PID’s) of several applications. If any one of the processes terminated unexpectedly, its process handle would no longer exist and his watchdog would immediately be signaled. The watchdog would then take an appropriate action to “heal” the system.

My question is, how would you implement such a watchdog on Linux? Is there a way for a single program to monitor the PID’s of many others?

Best Answer

The traditional, portable, commonly-used way is that the parent process watches over its children.

The basic primitives are the wait and waitpid system calls. When a child process dies, the parent process receives a SIGCHLD signal, telling it it should call wait to know which child died and its exit status. The parent process can instead choose to ignore SIGCHLD and call waitpid(-1, &status, WNOHANG) at its convenience.

To monitor many processes, you would either spawn them all from the same parent, or invoke them all through a simple monitoring process that just calls the desired program, waits for it to terminate and reports on the termination (in shell syntax: myprogram; echo myprogram $? >>/var/run/monitor-collector-pipe). If you're coming from the Windows world, note that having small programs doing one specialized task is a common design in the Unix world, the OS is designed to make processes cheap.

There are many process monitoring (also called supervisor) programs that can report when a process dies and optionally restart it and far more besides: Monit, Supervise, Upstart, …

Related Question