Shell – Handling of stale file locks in Linux and robust usage of flock

flocklockshell-script

I have a script I execute via cron regularly (every few minutes). However the script should not run multiple times in parallel and it sometimes runs a bit longer, thus I wanted to implement some locking, i.e. making sure the script is terminated early if a previous instance is already running.

Based on various recommendations I have a locking that looks like this:

lock="/run/$(basename "$0").lock"
exec {fd}<>"$lock"
flock -n $fd || exit 1

This should call the exit 1 in case another instance of the script is still running.

Now here's the problem: It seems sometimes a stale lock survives even though the script is already terminated. This effectively means the cron is never executed again (until the next reboot or by deleting the locked file), which of course is not what I want.

I figured out there's the lslocks command that lists existing file locks. It shows this:

(unknown)        2732 FLOCK        WRITE 0     0   0 /run...

The process (2732 in this case) no longer exists (e.g. in ps aux). It is also unclear to me why it doesn't show the full filename (i.e. only /run…). lslocks has a parameter –notrucate which sounded to me it may avoid truncating filenames, however that does not change the output, it's still /run…

So I have multiple questions:

Why are these locks there and what situation causes a lock from flock to exist beyond the lifetime of the process?
Why does lslocks not show the full path/filename?
What is a good way to avoid this and make the locking in the script more robust?
Is there some way to cleanup stale locks without a reboot?

Best Answer

An flock lock is associated with a file description object; it will go away once all file descriptors referring to the file description have been closed (see the flock.2 manpage).

If the file is still locked, then the file descriptor is almost certainly still referenced from either the original process or a child process (assuming that you haven't used things like file descriptor passing to propagate a reference to it outside the original process hierarchy).

I would recommend checking sudo fuser $lock_path.

To work around this issue, there are two methods I know of: Either you prevent the shell from letting child processes inherit the file descriptor, or you kill all the processes still referencing it, e.g. using fuser -k ....

The path you are seeing is incomplete because lslocks uses /proc/locks to gather information; this file contains an identifier for the mountpoint and information on the process that acquired the lock, but not the path to the locked file. If lslocks can't find the file descriptor holding the lock while inspecting that process, it falls back to only printing the mount point.

Related Solutions

Shell Script Locking – Correct Locking in Shell Scripts

Here's another way to do locking in shell script that can prevent the race condition you describe above, where two jobs may both pass line 3. The noclobber option will work in ksh and bash. Don't use set noclobber because you shouldn't be scripting in csh/tcsh. ;)

lockfile=/var/tmp/mylock

if ( set -o noclobber; echo "$$" > "$lockfile") 2> /dev/null; then

        trap 'rm -f "$lockfile"; exit $?' INT TERM EXIT

        # do stuff here

        # clean up after yourself, and release your trap
        rm -f "$lockfile"
        trap - INT TERM EXIT
else
        echo "Lock Exists: $lockfile owned by $(cat $lockfile)"
fi

YMMV with locking on NFS (you know, when NFS servers are not reachable), but in general it's much more robust than it used to be. (10 years ago)

If you have cron jobs that do the same thing at the same time, from multiple servers, but you only need 1 instance to actually run, the something like this might work for you.

I have no experience with lockrun, but having a pre-set lock environment prior to the script actually running might help. Or it might not. You're just setting the test for the lockfile outside your script in a wrapper, and theoretically, couldn't you just hit the same race condition if two jobs were called by lockrun at exactly the same time, just as with the 'inside-the-script' solution?

File locking is pretty much honor system behavior anyways, and any scripts that don't check for the lockfile's existence prior to running will do whatever they're going to do. Just by putting in the lockfile test, and proper behavior, you'll be solving 99% of potential problems, if not 100%.

If you run into lockfile race conditions a lot, it may be an indicator of a larger problem, like not having your jobs timed right, or perhaps if interval is not as important as the job completing, maybe your job is better suited to be daemonized.

EDIT BELOW - 2016-05-06 (if you're using KSH88)

Base on @Clint Pachl's comment below, if you use ksh88, use mkdir instead of noclobber. This mostly mitigates a potential race condition, but doesn't entirely limit it (though the risk is miniscule). For more information read the link that Clint posted below.

lockdir=/var/tmp/mylock
pidfile=/var/tmp/mylock/pid

if ( mkdir ${lockdir} ) 2> /dev/null; then
        echo $$ > $pidfile
        trap 'rm -rf "$lockdir"; exit $?' INT TERM EXIT
        # do stuff here

        # clean up after yourself, and release your trap
        rm -rf "$lockdir"
        trap - INT TERM EXIT
else
        echo "Lock Exists: $lockdir owned by $(cat $pidfile)"
fi

And, as an added advantage, if you need to create tmpfiles in your script, you can use the lockdir directory for them, knowing they will be cleaned up when the script exits.

For more modern bash, the noclobber method at the top should be suitable.

Monitoring file locks, locked using flock

From man lsof:

FD is the File Descriptor number of the file or: FD is followed by one of these characters, describing the mode under which the file is open:

          The mode character is followed by one of these lock characters, describing the type of lock applied to the file:

               R for a read lock on the entire file;
               W for a write lock on the entire file;
               space if there is no lock.

So R in 3uR mean that read/shared lock is issued by 613 PID.

#lsof /tmp/file
COMMAND PID    USER   FD   TYPE DEVICE SIZE/OFF    NODE NAME
perl    613 turkish    3uR  REG    8,2        0 1306357 /tmp/file

Reading directly from /proc/locks is faster than lsof,

perl -F'[:\s]+' -wlanE'
  BEGIN { $inode = (stat(pop))[1]; @ARGV = "/proc/locks" }
  say "pid:$F[4] [$_]" if $F[7] == $inode
' /tmp/file

Best Answer

Related Solutions

Shell Script Locking – Correct Locking in Shell Scripts

EDIT BELOW - 2016-05-06 (if you're using KSH88)

Monitoring file locks, locked using flock

Related Question