Ssh – re-touching files over sshfs

filesystemssshsshfssynchronizationtouch

Machine A's file system is mounted on machine B via sshfs. A process running on B, which was initiated from A by ssh touches a file on the mounted file system of A to communicate a signal; the signal (file) is then removed by A. This works reliably the first time the file is created/destroyed (touch/rm).

However, if a second process (again, running on B, spawned from A) tries to touch exactly the same file, the following error is sporadically thrown:

`touch: cannot touch '/path/to/file': No such file or directory`.

The path is valid as judged by the fact that attempts to touch it manually after the error is thrown are uniformly successful. As mentioned, the error is sporadic (complicating attempts to debug), but only occurs when the file is touched after already having undergone a cycle of creation/deletion.

The actions that intermittently produce an error (touch, rm, touch) are separated in time so concurrent access is unlikely to be the culprit (i.e., the second touch does not happen until the file produced by the first touch is removed). Thinking the cause might stem from file system buffering, sync is called from A after removing the file, to no avail. Calling sync from B immediately before touching the file also does not help, though I do not know whether B's sync call affects the file system of A (the version of sync on B lacks the -f option for explicit file system specification; I tried to call sync on A from the process running on B via ssh user@A sync before touching but the process seems to exit without error just after the sync-over-ssh call since the remaining lines including the touch statement are not executed; perhaps because it is not possible to ssh from the server back to the client on a process initiated by ssh from the client to the server).

How may the cause of this file system-related error be determined?

Best Answer

You can investigate what might be happening by running sshfs with option -o debug. It prints a lot of information on the basic filesystem operations done by a touch test command. An example operation is:

unique: 209, opcode: LOOKUP (1), nodeid: 1, insize: 45, pid: 10641
LOOKUP /test
getattr /test
   NODEID: 44
   unique: 209, success, outsize: 144

The relevant part is that a getattr call was done, and it ended in success. When you do a successful touch on a non-existant file the operations we see are (removing the details):

getattr /test
   unique: 190, error: -2 (No such file or directory), outsize: 16
create flags: 0x8841 /test 0100644 umask=0022
fgetattr[140469187119648] /test
flush[140469187119648]
utime /test 1507647885 1507647885
getattr /test
flush[140469187119648]
release[140469187119648] flags: 0x8801

We see the getattr test for the file fails, which is normal as it does not exist, so we go on to create the file.

If the file is now removed on the server, and we do the touch again on the client we see a different sequence:

getattr /test
   unique: 215, success, outsize: 144
open flags: 0x8801 /test
   unique: 216, error: -2 (No such file or directory), outsize: 16

Now the getattr says the file still exists, so touch goes on to open() the file, but this results in your error message: the file does not really exist at all.

So it all seems to be a problem of the client cache of what files exist being too slow to catch up with changes at the remote. The simplest answer is to mount your remote with a shorter timeout for the getattr call, i.e. the stat() system call. This should work for you

sshfs -o cache_stat_timeout=0 ...
Related Question