I want to delete old files in a directory that has a huge number of files in multiple subdirectories.
I am trying to use the following – after some googling it seems to be the recommended and efficient way:
find . -mindepth 2 -mtime +5 -print -delete
My expectation is that, this should print a file that satisfies the conditions (modified more than 5 days ago and satisfies the mindepth condition) and then delete it, and then move on to the next file.
However, as this command runs, I can see that the find's memory usage is increasing, but nothing has been printed (and therefore I think nothing has been deleted yet). This seems to imply that find
is first collecting all files that satisfy the conditions and after traversing the whole filesystem tree, it will print and then delete the files.
Is there a way to get it to delete it right away after running the tests on the file? This would help do the clean up incrementally – I can choose to kill the command and then rerun it later (which would effectively resume file deletion). This does not seem to happen currently because find has not begun deleting anything until its done traversing the gigantic filesystem tree. Is there any way around this?
EDIT – Including requested data about my use case:
The directories I have to clean up have a maximum depth of about 4; regular files are present only at the leaf of the filesystem. There are around about 600 million regular files, with the leaf directories containing at most 5 files. The directory fan-out at the lower levels is about 3. The fan-out is huge at the upper levels. Total space occupied is 6.5TB on a single 7.2TB LVM disk (with 4 physical ~2 TB HDDs)
Best Answer
The reason why the find command is slow
That is a really interesting issue... or, honestly, mallicious:
The command
find . -mindepth 2 -mtime +5 -print -delete
is very different from the usual tryout variant, leaving out the dangerous part,
-delete
:find . -mindepth 2 -mtime +5 -print
The tricky part is that the action
-delete
implies the option-depth
. The command including delete is reallyfind . -depth -mindepth 2 -mtime +5 -print -delete
and should be tested with
find . -depth -mindepth 2 -mtime +5 -print
That is closely related to the symtoms you see; The option
-depth
is changing the tree traversal algorithm for the file system tree from an preorder depth-first search to an inorder depth-first search.Before, each file or directory that was reached was immediately used, and forgotten about. Find was using the tree itself to find it's way.
find
will now need to collect all directories that could contain files or directories still to be found, before deleting the files in the deepest directoies first. For this, it needs to do the work of planing and remembering traversal steps itself, and - that's the point - in a different order than the filesystem tree naturally supports. So, indeed, it needs to collect data over many files before the first step of output work.Find has to keep track of some directories to visit later, which is not a problem for a few directories.
But maybe with many directories, for various degrees of many.
Also, performance problems outside of find will get noticable in this kind of situation; So it is possible it's not even
find
that's slow, but something else.The performance and memory impact of that depends on your directory structure etc.
The relevant sections from
man find
:See the "Warnings":
And, from a section further up:
The faster solution to delete the files
You do not really need to delete the directories in the same run of deleting the files, right? If we are not deleting directories, we do not need the whole
-depth
thing, we can just find a file and delete it, and go on to the next, as you proposed.This time we can use the simple print variant for testing the
find
, with implicit-print
.We want to find only plain files, no symlinks, directories, special files etc:
find . -mindepth 2 -mtime +5 -type f
We use
xargs
to delete more than one file perrm
process started, taking care of odd filenames by using a null byte as separator:Testing this command - note the
echo
in front of therm
, so it prints what will be run later:find . -mindepth 2 -mtime +5 -type f -print0 | xargs -0 echo rm
The lines will be very long and hard to read; For an initial test it could help to get readable output with only three files per line by adding
-n 3
as first arguments ofxargs
If all looks good, remove the
echo
in front of therm
and run again.That should be a lot faster;
In case we are talking about millions of files - you wrote it's 600 million files in total - there is something more to take into account:
Most programs, including
find
, read directories using the library callreaddir (3)
. That usually uses a buffer of 32 KB to read directories; That becomes a problem when the directories, containing huge lists of possibly long filenames, are big.The way to work around it is to directly use the system call for reading directory entries,
getdents (2)
, and handle the buffering in a more suitable way.For details, see You can list a directory containing 8 million files! But not with ls..
(It would be interesting if you can add details to your question on the typical numbers of files per directroy, directories per directory, max depth of paths; Also, which filesystem is used.)
(If it is still slow, you should check for filesystem performance problems.)