Bash – How to loop over ever-increasing list of files in bash

bashfiles

I have a generator of files running, where each file has a name alphabetically following the previous one. At first I was doing my loop like for file in /path/to/files*; do..., but I soon realized that the glob will only expand before the loop, and any new files created while looping won't be processed.

My current way of doing this is quite ugly:

while :; do
    doneFileCount=$(wc -l < /tmp/results.csv)
    i=0
    for file in *; do
        if [[ $((doneFileCount>i)) = 1 ]]; then
            i=$((i+1))
            continue
        else
            process-file "$file" # prints single line to stdout
            i=$((i+1))
        fi
    done | tee -a /tmp/results.csv
done

Is there any simple way to loop over ever-increasing list of files, without the hack described above?

Best Answer

I think the usual way would be to have new files appear in one directory, and rename/move them to another after processing, so that they don't hit the same glob again. So something like this

cd new/
while true; do 
    for f in * ; do
        process file "$f" move to "../processed/$f"
    done
    sleep 1   # just so that it doesn't busyloop
done

Or similarly with a changing file extension:

while true; do 
    for f in *.new ; do
        process file "$f" move to "${f%.new}.done"
    done
    sleep 1   # just so that it doesn't busyloop
done

On Linux, you could also use inotifywait to get notifications on new files.

inotifywait -q -m -e moved_to,close_write --format "%f" . | while read -r f ; do
    process file "$f"
done

In either case, you'll want to watch for files that are still being written to. A large file created in-place will not appear atomically, but your script might start processing it when it's only halfway written.

The inotify close_write event above will see files when the writing process closes them (but it also catches modified files), while the create event would see the file when it's first created (but it might still be written to). moved_to simply catches files that are moved to the directory being watched.

Related Question