Find Command – Can the ‘find’ Command Work More Efficiently to Delete Many Files?

find

I want to delete old files in a directory that has a huge number of files in multiple subdirectories.

I am trying to use the following – after some googling it seems to be the recommended and efficient way:

find . -mindepth 2 -mtime +5 -print -delete

My expectation is that, this should print a file that satisfies the conditions (modified more than 5 days ago and satisfies the mindepth condition) and then delete it, and then move on to the next file.

However, as this command runs, I can see that the find's memory usage is increasing, but nothing has been printed (and therefore I think nothing has been deleted yet). This seems to imply that find is first collecting all files that satisfy the conditions and after traversing the whole filesystem tree, it will print and then delete the files.

Is there a way to get it to delete it right away after running the tests on the file? This would help do the clean up incrementally – I can choose to kill the command and then rerun it later (which would effectively resume file deletion). This does not seem to happen currently because find has not begun deleting anything until its done traversing the gigantic filesystem tree. Is there any way around this?

EDIT – Including requested data about my use case:

The directories I have to clean up have a maximum depth of about 4; regular files are present only at the leaf of the filesystem. There are around about 600 million regular files, with the leaf directories containing at most 5 files. The directory fan-out at the lower levels is about 3. The fan-out is huge at the upper levels. Total space occupied is 6.5TB on a single 7.2TB LVM disk (with 4 physical ~2 TB HDDs)

Best Answer

The reason why the find command is slow

That is a really interesting issue... or, honestly, mallicious:

The command

find . -mindepth 2 -mtime +5 -print -delete

is very different from the usual tryout variant, leaving out the dangerous part, -delete:

find . -mindepth 2 -mtime +5 -print

The tricky part is that the action -delete implies the option -depth. The command including delete is really

find . -depth -mindepth 2 -mtime +5 -print -delete

and should be tested with

find . -depth -mindepth 2 -mtime +5 -print

That is closely related to the symtoms you see; The option -depth is changing the tree traversal algorithm for the file system tree from an preorder depth-first search to an inorder depth-first search.
Before, each file or directory that was reached was immediately used, and forgotten about. Find was using the tree itself to find it's way. find will now need to collect all directories that could contain files or directories still to be found, before deleting the files in the deepest directoies first. For this, it needs to do the work of planing and remembering traversal steps itself, and - that's the point - in a different order than the filesystem tree naturally supports. So, indeed, it needs to collect data over many files before the first step of output work.

Find has to keep track of some directories to visit later, which is not a problem for a few directories.
But maybe with many directories, for various degrees of many.
Also, performance problems outside of find will get noticable in this kind of situation; So it is possible it's not even find that's slow, but something else.

The performance and memory impact of that depends on your directory structure etc.

The relevant sections from man find:

See the "Warnings":

ACTIONS
    -delete
           Delete  files;  true if removal succeeded.  If the removal failed,
           an error message is issued.  If -delete fails, find's exit  status
           will  be nonzero (when it eventually exits).  Use of -delete auto‐
           matically turns on the -depth option.

           Warnings: Don't forget that the find command line is evaluated  as
           an  expression,  so  putting  -delete  first will make find try to
           delete everything below the starting points you  specified.   When
           testing  a  find  command  line  that you later intend to use with
           -delete, you should explicitly specify -depth in  order  to  avoid
           later  surprises.  Because -delete implies -depth, you cannot use‐
           fully use -prune and -delete together.
    [ ... ]

And, from a section further up:

 OPTIONS
    [ ... ]
    -depth Process each directory's contents  before  the  directory  itself.
           The -delete action also implies -depth.

The faster solution to delete the files

You do not really need to delete the directories in the same run of deleting the files, right? If we are not deleting directories, we do not need the whole -depth thing, we can just find a file and delete it, and go on to the next, as you proposed.

This time we can use the simple print variant for testing the find, with implicit -print.

We want to find only plain files, no symlinks, directories, special files etc:

find . -mindepth 2 -mtime +5 -type f

We use xargs to delete more than one file per rm process started, taking care of odd filenames by using a null byte as separator:

Testing this command - note the echo in front of the rm, so it prints what will be run later:

find . -mindepth 2 -mtime +5 -type f -print0 | xargs -0 echo rm

The lines will be very long and hard to read; For an initial test it could help to get readable output with only three files per line by adding -n 3 as first arguments of xargs

If all looks good, remove the echo in front of the rm and run again.

That should be a lot faster;

In case we are talking about millions of files - you wrote it's 600 million files in total - there is something more to take into account:

Most programs, including find, read directories using the library call readdir (3). That usually uses a buffer of 32 KB to read directories; That becomes a problem when the directories, containing huge lists of possibly long filenames, are big.

The way to work around it is to directly use the system call for reading directory entries, getdents (2), and handle the buffering in a more suitable way.

For details, see You can list a directory containing 8 million files! But not with ls..

(It would be interesting if you can add details to your question on the typical numbers of files per directroy, directories per directory, max depth of paths; Also, which filesystem is used.)

(If it is still slow, you should check for filesystem performance problems.)

Best Answer

The reason why the find command is slow

The faster solution to delete the files

Related Solutions

How to find only files, with which command xyz ends successfully

Print Files Acted on by Find Command

Related Question