Foolproof way to detect if java process is running

background-processjavaprocess

I have a java process that runs continually that sometimes, for reasons I have yet to fully debug, craps out. So, I also have a cron job that looks for the process every 5 minutes and if it finds the process isn't running, it calls a script to restart it.

The problem is, sometimes, every once in a while, the check-up script gets a false negative — it thinks the process isn't running when in fact it is. I haven't seen any consistency to when it does this. But I do need a completely foolproof way to check whether the process is running.

What I'm doing currently is this:

if ! pgrep -f '/path/to/XML2DB.jar -n' > /dev/null
then
    nice -n 19 java -Xmx2024M -jar /path/to/XML2DB.jar -n > /dev/null 2>/dev/null &
    echo "" | mail -s "$HOST: xml2db was found not running, is being started" support@mycompany.com
fi

Before pgrep, we were using ! ps ax | grep -v grep | grep "XML2DB.jar -n" > /dev/null but this was also giving false positives.

Linux version is Scientific Linux SL release 3.0.9 (SL) and LSB Version is 1.3.

Thanks in advance!

Best Answer

There is no way to reliably and usefully check that an unrelated process is running: a race condition is always possible. Even if you find a way to detect whether the process you're interested in is running, it might be killed immediately after you've seen it, or conversely it might get started immediately after you missed it.

If you control the program or the way it runs, you can make it reserve a unique resource such as a file lock. However, if you control the way the program is invoked, there's a simpler way to keep track of it: monitor it from its parent. A process is informed when its child dies.

The simplest way to ensure that a process is always running is to restart it in a loop.

# sleep 1 avoids a tight loop if the process systematically fails to start
while sleep 1; do
  nice …
  ret=$?
  if [ $ret -le 127 ]; then
    msg="… exited with status $ret"
  else
    msg="… exited on signal $((ret-128))"
  esac
  mail -s "$msg" "$USER"
done

There is more robust and more powerful monitoring software. See How to set proper monitoring of my services in a automated way? So that if one crash it auto on the fly restarts?

Related Question