MacOS – Why Does a Shell Script Trapping SIGTERM Work When Run Manually, But Not When Run via launchd

command linelaunchdmacosterminal

Okay, quite simply I have a shell script that needs to wait for something to happen, but it has a lock-file and some child-processes that I need to ensure are tidied up if the script is interrupted.

I've achieved this without issue by using the trap command to set some appropriate actions, and have come up with a script that looks a bit like this:

#!/bin/sh
LOG="$0.log"

# Create a lock-file to prevent simultaneous access
lockfile -l 86400 "$LOG.lock" || $(echo 'Locking failed' >&2 && exit 3)

# Create trap for interrupt and cleanup
on_complete() {
    echo $(date +%R)' Ended.' >> "$LOG"
    kill $(jobs -p)
    rm -f "$LOG.lock"
    exit
}
trap 'on_complete 2> /dev/null' SIGTERM SIGINT SIGHUP EXIT

# Do nothing
echo $(date +%R)' Running…' >> "$LOG"
sleep 86400 &
while wait; do sleep 86400 &; done

This can be run just fine in a terminal via sh Example.sh, and terminating it with Ctrl + C, causing it to remove its lock-file without any fuss.

I then tried creating a launchd job for this script like so:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>org.example</string>
    <key>ProgramArguments</key>
    <array>
        <string>sh</string>
        <string>~/Downloads/Example.sh</string>
    </array>
    <key>RunAtLoad</key>
    <true/>
    <key>EnableGlobbing</key>
    <true/>
</dict>
</plist>

Creating Example.sh and Example.plist from the above in the ~/Downloads folder allows me to run the launchd job via launchd load ~/Downloads/Example.plist and end it via launchd unload ~/Downloads/Example.plist. However, ending the job doesn't cause a SIGTERM to reach the script, which is instead SIGKILL'd after the 20 second timeout.

So what I'd like to know is; why isn't my script receiving SIGTERM, and how I can ensure it does?

Best Answer

The ultimate problem here is that Bash does not normally kill its non-builtin children.

If bash is waiting for a command to complete and receives a signal for which a
trap has been set, the trap will not be executed until the command completes.
When bash is waiting for an asynchronous command  via  the  wait  builtin, the
reception of a signal for which a trap has been set will cause the wait builtin
to return immediately with an exit status greater than 128, immediately after
which  the trap is executed.

When you hit <CTRL>+<C> you're killing the shell script, which behaves normally -- but the sleep lives on. Use ps to see.

When try to stop things externally, via kill, then Bash as above. After some time-out period (I'm guessing 20 seconds) launchd then issues a kill -9 which the script cannot trap.

The solution is to issue a wait after the sleep, to indicate to Bash that it can interrupt itself:

sleep 86400 & wait

This will allow the script to be interrupted, but the sleep will still survive. I'm sure there's a way to kill the children, but I didn't bother looking it up...