Is `ln` atomic and reliable on NFS? Could NFS replace GFS in this use case

concurrencygfslnlocknfs

I have a cluster with a bunch of servers with a shared disk containing a GFS global file system that all nodes access simultaneously.

Each node in the cluster run the same program (a shell script is the main core).
The system processes files that appear in a couple of input directories, and it works like this:

the program loops through the input directories.
for each file found, check existence of a "lock file", if lock file exists skip to next file.
if no lock file found, create lock file. If lockfile creation failed (race lost), skip to next file
if "we" own the lock, process the file and move it out of the way when it is finished.

This all works very well, but I wonder if there are cheaper (less complex) solutions that would also work. I'm thinking NFS or SMB perhaps.

There are two reasons for my use of GFS:

each file is stored in one place only (on redundant underlying hardware of course)
file locking works reliably

I create the lockfile like this:

date '+%s:'${unid} > ${currlock}.${unid}
ln ${currlock}.${unid} ${currlock}
lockrc=$?
rm -f ${currlock}.${unid}

where $unid is a unique session identifier and $currlock is /gfs/tmp/lock.${file_to_process}

The beauty of ln is that it is atomic, so it fails for all but one that attempts the same thing at the same time.

So, I guess what I'm asking is: will NFS fill my needs? Does ln work reliably in the same way on NFS as on GFS?

Best Answer

The link() system call on the NFS client should map directly to the NFS LINK operation, which the server should implement using its link() system call. So as long as link() is atomic on the server, it will also be atomic on the clients.

Related Solutions

Shell Script Locking – Correct Locking in Shell Scripts

Here's another way to do locking in shell script that can prevent the race condition you describe above, where two jobs may both pass line 3. The noclobber option will work in ksh and bash. Don't use set noclobber because you shouldn't be scripting in csh/tcsh. ;)

lockfile=/var/tmp/mylock

if ( set -o noclobber; echo "$$" > "$lockfile") 2> /dev/null; then

        trap 'rm -f "$lockfile"; exit $?' INT TERM EXIT

        # do stuff here

        # clean up after yourself, and release your trap
        rm -f "$lockfile"
        trap - INT TERM EXIT
else
        echo "Lock Exists: $lockfile owned by $(cat $lockfile)"
fi

YMMV with locking on NFS (you know, when NFS servers are not reachable), but in general it's much more robust than it used to be. (10 years ago)

If you have cron jobs that do the same thing at the same time, from multiple servers, but you only need 1 instance to actually run, the something like this might work for you.

I have no experience with lockrun, but having a pre-set lock environment prior to the script actually running might help. Or it might not. You're just setting the test for the lockfile outside your script in a wrapper, and theoretically, couldn't you just hit the same race condition if two jobs were called by lockrun at exactly the same time, just as with the 'inside-the-script' solution?

File locking is pretty much honor system behavior anyways, and any scripts that don't check for the lockfile's existence prior to running will do whatever they're going to do. Just by putting in the lockfile test, and proper behavior, you'll be solving 99% of potential problems, if not 100%.

If you run into lockfile race conditions a lot, it may be an indicator of a larger problem, like not having your jobs timed right, or perhaps if interval is not as important as the job completing, maybe your job is better suited to be daemonized.

EDIT BELOW - 2016-05-06 (if you're using KSH88)

Base on @Clint Pachl's comment below, if you use ksh88, use mkdir instead of noclobber. This mostly mitigates a potential race condition, but doesn't entirely limit it (though the risk is miniscule). For more information read the link that Clint posted below.

lockdir=/var/tmp/mylock
pidfile=/var/tmp/mylock/pid

if ( mkdir ${lockdir} ) 2> /dev/null; then
        echo $$ > $pidfile
        trap 'rm -rf "$lockdir"; exit $?' INT TERM EXIT
        # do stuff here

        # clean up after yourself, and release your trap
        rm -rf "$lockdir"
        trap - INT TERM EXIT
else
        echo "Lock Exists: $lockdir owned by $(cat $pidfile)"
fi

And, as an added advantage, if you need to create tmpfiles in your script, you can use the lockdir directory for them, knowing they will be cleaned up when the script exits.

For more modern bash, the noclobber method at the top should be suitable.

NFS file locking not working, am I misunderstanding

flock doesn't work over NFS. (It never has, even on UNIX systems.)

See flock vs lockf on Linux for one comparison of lockf and flock.

Here is a possible solution Correct locking in shell scripts?

Best Answer

Related Solutions

Shell Script Locking – Correct Locking in Shell Scripts

EDIT BELOW - 2016-05-06 (if you're using KSH88)

NFS file locking not working, am I misunderstanding

Related Question