While answering an older question it struck me that it seems find
, in the following example, potentially would process files multiple times:
find dir -type f -name '*.txt' \
-exec sh -c 'mv "$1" "${1%.txt}_hello.txt"' sh {} ';'
or the more efficient
find dir -type f -name '*.txt' \
-exec sh -c 'for n; do mv "$n" "${n%.txt}_hello.txt"; done' sh {} +
The command finds .txt
files and changes their filename suffix from .txt
to _hello.txt
.
While doing so, the directories will start accumulating new files whose names matches the *.txt
pattern, namely these _hello.txt
files.
Question: Why are they not actually processed by find
? Because in my experience they aren't, and we don't want them to be either as it would introduce a sort of infinite loop. This is also the case with mv
replaced by cp
, by the way.
The POSIX standard says (my emphasis)
If a file is removed from or added to the directory hierarchy being searched it is unspecified whether or not
find
includes that file in its search.
Since it's unspecified whether new files will be included, maybe a safer approach would be
find dir -type d -exec sh -c '
for n in "$1"/*.txt; do
test -f "$n" && mv "$n" "${n%.txt}_hello.txt"
done' sh {} ';'
Here, we don't look for files but for directories, and the for
loop of the internal sh
script evaluates its range once before the first iteration, so we don't have the same potential issue.
The GNU find
manual does not explicitly say anything about this and neither does the OpenBSD find
manual.
Best Answer
Can
find
find files that were created while it was walking the directory?In brief: Yes, but it depends on the implementation. It's probably best to write the conditions so that already processed files are ignored.
As mentioned, POSIX makes no guarantees either way, like it also makes no guarantees on the underlying
readdir()
system call:I tested the
find
on my Debian (GNU find, Debian package version4.6.0+git+20161106-2
).strace
showed that it read the full directory before doing anything.Browsing the source code a bit more makes it seem that GNU find uses parts of gnulib to read the directories, and there's this in gnulib/lib/fts.c (
gl/lib/fts.c
in thefind
tarball):I changed that limit to 100, and did
resulting in such hilarious results as this file, which got renamed five times:
Obviously, a very large directory (more than 100 000 entries) would be needed to trigger that effect on a default build of GNU find, but a trivial readdir+process loop without caching would be even more vulnerable.
In theory, if the OS always added renamed files last in the order where
readdir()
returned them, a simple implementation like that could even fall into an endless loop.On Linux,
readdir()
in the C library is implemented through thegetdents()
system call, which already returns multiple directory entries at one go. Which means that later calls toreaddir()
might return files that were already removed, but for very small directories you'd effectively get a snapshot of the starting state. I don't know about other systems.In the above test, I did the renames to a longer file name on purpose: to prevent the file name from being overwritten in-place. No matter, the same test on a same-length rename also did double and triple renames. If and how this matters would of course depend on the filesystem internals.
Considering all this, it's probably prudent to avoid the whole issue by making the
find
expression not match the files that were already processed. That is, to add-name "*.foo"
in my example or! -name "*_hello.txt"
to the command in the question.