Faux pas: The "fast" method I mention below, is not 60 times faster than the slow one. It is 30 times faster. I'll blame the mistake on the hour (3AM is not my best time of day for clear thinking :)..
Update: I've added a summary of test times (below).
There seem to be two issues involved with the speed factor:
- The choice of command used (Time comparisons shown below)
- The nature of large numbers of files in a directory… It seems that "big is bad". Things get disoprportionately slower as the numbers increase..
All the tests have been done with 1 million files.
(real, user, and sys times are in the test scripts)
The test scripts can be found at paste.ubuntu.com
#
# 1 million files
# ===============
#
# |time |new dir |Files added in ASCENDING order
# +---- +------- +-------------------------------------------------
# real 01m 33s Add files only (ASCENDING order) ...just for ref.
# real 02m 04s Add files, and make 'rm' source (ASCENDING order)
# Add files, and make 'rm' source (DESCENDING order)
# real 00m 01s Count of filenames
# real 00m 01s List of filenames, one per line
# ---- ------- ------
# real 01m 34s 'rm -rf dir'
# real 01m 33s 'rm filename' via rm1000filesPerCall (1000 files per 'rm' call)
# real 01m 40s 'rm filename' via ASCENDING algorithm (1000 files per 'rm' call)
# real 01m 46s 'rm filename' via DESCENDING algorithm (1000 files per 'rm' call)
# real 21m 14s 'rm -r dir'
# real 21m 27s 'find dir -name "hello*" -print0 | xargs -0 -n 1000 rm'
# real 21m 56s 'find dir -name "hello*" -delete'
# real 23m 09s 'find dir -name "hello*" -print0 | xargs -0 -P 0 rm'
# real 39m 44s 'rm filename' (one file per rm call) ASCENDING
# real 47m 26s 'rm filename' (one file per rm call) UNSORTED
#
I recently created and deleted 10 million empty test files.
Deleting files on a name by name basis (ie rm filename
), I found out the hard way that there is a huge time difference between 2 different methods…
Both methods use the exact same rm filename
command.
Update: as it turns out, the commands were not exactly the same… One of them was sending 1000 filenames at a time to 'rm'… It was a shell brace-expansion issue where I thought each filename was being written to the feeder file on a line of its own, but actually it was 1000 per line
The filnames are provide via a 'feeder file' into a while read
loop..
The feeder file is the output of ls -1 -f
The methods are identical in all reaspects, except for one thing:
- the slow method uses the unsorted feeder file direct from
ls -1 -f
- the fast method uses a sorted version of that same unsorted file
I'm not sure whether the sorting is ths issue here, or is it perhaps that the sorted feeder file just happens to match the sequence in which the files were created (I used a simple ascending integer algorithm)
For 1 million files, the fast rm filename
method is 60 times faster than the slow method… again, I don't know if this is a "sorting" issue, or a behind-the-scenes hash table issue… I suspect it is not a simple sorting issue, because why would ls -1 -f
intentionally give me an unsort listing of a freshly added "sorted" sequence of filenames…
I'm just wondering what is going on here, so it doesn't take me days (yes days) to delete the next 10 million files 🙂 …. I say "days" because I tried so many alternatives, and the times involved increase disproportionatly to the numberof file involved .. so I've only tested 1 million in detail
BTW: Deleting the files via the "sorted list" of names is actually faster than rm -rf
by a factor of 2.
and: rm -r
was 30 times slower than the "sorted list" method
… but is "sorted" the issue here? or is it more related to a hashing(or whatever) method of storage used by ext4?
The thing which quite puzzles me is that each call to rm filename
is unrelated to the previous one .. (well, at least it is that way from the 'bash' perspective)
I'm using Ubuntu / bash / 'ext4' / SATA II drive.
Best Answer
rm -r is expected to be slow as its recursive. A depth first traversal has to be made on the directory structure.
Now how did you create 10 million files ? did u use some script which loops on some order ? 1.txt,2.txt,3.txt... if yes then those files may too be allocated on same order in contigous blocks in hdd.so deleting on same order will be faster.
"ls -f" will enable -aU which lists in directory order which is again recursive.