Will we ever “find” files whose names are changed by “find”? Why not

find

While answering an older question it struck me that it seems find, in the following example, potentially would process files multiple times:

find dir -type f -name '*.txt' \
    -exec sh -c 'mv "$1" "${1%.txt}_hello.txt"' sh {} ';'

or the more efficient

find dir -type f -name '*.txt' \
    -exec sh -c 'for n; do mv "$n" "${n%.txt}_hello.txt"; done' sh {} +

The command finds .txt files and changes their filename suffix from .txt to _hello.txt.

While doing so, the directories will start accumulating new files whose names matches the *.txt pattern, namely these _hello.txt files.

Question: Why are they not actually processed by find? Because in my experience they aren't, and we don't want them to be either as it would introduce a sort of infinite loop. This is also the case with mv replaced by cp, by the way.

The POSIX standard says (my emphasis)

If a file is removed from or added to the directory hierarchy being searched it is unspecified whether or not find includes that file in its search.

Since it's unspecified whether new files will be included, maybe a safer approach would be

find dir -type d -exec sh -c '
    for n in "$1"/*.txt; do
        test -f "$n" && mv "$n" "${n%.txt}_hello.txt"
    done' sh {} ';'

Here, we don't look for files but for directories, and the for loop of the internal sh script evaluates its range once before the first iteration, so we don't have the same potential issue.

The GNU find manual does not explicitly say anything about this and neither does the OpenBSD find manual.

Best Answer

Can find find files that were created while it was walking the directory?

In brief: Yes, but it depends on the implementation. It's probably best to write the conditions so that already processed files are ignored.

As mentioned, POSIX makes no guarantees either way, like it also makes no guarantees on the underlying readdir() system call:

If a file is removed from or added to the directory after the most recent call to opendir() or rewinddir(), whether a subsequent call to readdir() returns an entry for that file is unspecified.


I tested the find on my Debian (GNU find, Debian package version 4.6.0+git+20161106-2). strace showed that it read the full directory before doing anything.

Browsing the source code a bit more makes it seem that GNU find uses parts of gnulib to read the directories, and there's this in gnulib/lib/fts.c (gl/lib/fts.c in the find tarball):

/* If possible (see max_entries, below), read no more than this many directory
   entries at a time.  Without this limit (i.e., when using non-NULL
   fts_compar), processing a directory with 4,000,000 entries requires ~1GiB
   of memory, and handling 64M entries would require 16GiB of memory.  */
#ifndef FTS_MAX_READDIR_ENTRIES
# define FTS_MAX_READDIR_ENTRIES 100000
#endif

I changed that limit to 100, and did

mkdir test; cd test; touch {0000..2999}.foo
find . -type f -exec sh -c 'mv "$1" "${1%.foo}.barbarbarbarbarbarbarbar"' sh {} \; -print

resulting in such hilarious results as this file, which got renamed five times:

1046.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar

Obviously, a very large directory (more than 100 000 entries) would be needed to trigger that effect on a default build of GNU find, but a trivial readdir+process loop without caching would be even more vulnerable.

In theory, if the OS always added renamed files last in the order where readdir() returned them, a simple implementation like that could even fall into an endless loop.

On Linux, readdir() in the C library is implemented through the getdents() system call, which already returns multiple directory entries at one go. Which means that later calls to readdir() might return files that were already removed, but for very small directories you'd effectively get a snapshot of the starting state. I don't know about other systems.

In the above test, I did the renames to a longer file name on purpose: to prevent the file name from being overwritten in-place. No matter, the same test on a same-length rename also did double and triple renames. If and how this matters would of course depend on the filesystem internals.

Considering all this, it's probably prudent to avoid the whole issue by making the find expression not match the files that were already processed. That is, to add -name "*.foo" in my example or ! -name "*_hello.txt" to the command in the question.

Related Question