Bash – Why is deleting files by name painfully slow and also exceptionally fast

bashext4filesystems

Faux pas: The "fast" method I mention below, is not 60 times faster than the slow one. It is 30 times faster. I'll blame the mistake on the hour (3AM is not my best time of day for clear thinking :)..

Update: I've added a summary of test times (below).
There seem to be two issues involved with the speed factor:

  • The choice of command used (Time comparisons shown below)
  • The nature of large numbers of files in a directory… It seems that "big is bad". Things get disoprportionately slower as the numbers increase..

All the tests have been done with 1 million files.
(real, user, and sys times are in the test scripts)
The test scripts can be found at paste.ubuntu.com

#
# 1 million files           
# ===============
#
#  |time   |new dir   |Files added in  ASCENDING order  
#  +----   +-------   +------------------------------------------------- 
#   real    01m 33s    Add files only (ASCENDING order) ...just for ref.
#   real    02m 04s    Add files, and make 'rm' source (ASCENDING order) 
#                      Add files, and make 'rm' source (DESCENDING order) 
#   real    00m 01s    Count of filenames
#   real    00m 01s    List of filenames, one per line
#   ----    -------    ------
#   real    01m 34s    'rm -rf dir'
#   real    01m 33s    'rm filename' via rm1000filesPerCall   (1000 files per 'rm' call)
#   real    01m 40s    'rm filename' via  ASCENDING algorithm (1000 files per 'rm' call)
#   real    01m 46s    'rm filename' via DESCENDING algorithm (1000 files per 'rm' call)
#   real    21m 14s    'rm -r dir'
#   real    21m 27s    'find  dir -name "hello*" -print0 | xargs -0 -n 1000 rm'
#   real    21m 56s    'find  dir -name "hello*" -delete'
#   real    23m 09s    'find  dir -name "hello*" -print0 | xargs -0 -P 0 rm'
#   real    39m 44s    'rm filename' (one file per rm call) ASCENDING
#   real    47m 26s    'rm filename' (one file per rm call) UNSORTED
#                                                       

I recently created and deleted 10 million empty test files.
Deleting files on a name by name basis (ie rm filename), I found out the hard way that there is a huge time difference between 2 different methods…

Both methods use the exact same rm filename command.

Update: as it turns out, the commands were not exactly the same… One of them was sending 1000 filenames at a time to 'rm'… It was a shell brace-expansion issue where I thought each filename was being written to the feeder file on a line of its own, but actually it was 1000 per line

The filnames are provide via a 'feeder file' into a while read loop..
The feeder file is the output of ls -1 -f
The methods are identical in all reaspects, except for one thing:

  • the slow method uses the unsorted feeder file direct from ls -1 -f
  • the fast method uses a sorted version of that same unsorted file

I'm not sure whether the sorting is ths issue here, or is it perhaps that the sorted feeder file just happens to match the sequence in which the files were created (I used a simple ascending integer algorithm)

For 1 million files, the fast rm filename method is 60 times faster than the slow method… again, I don't know if this is a "sorting" issue, or a behind-the-scenes hash table issue… I suspect it is not a simple sorting issue, because why would ls -1 -f intentionally give me an unsort listing of a freshly added "sorted" sequence of filenames…

I'm just wondering what is going on here, so it doesn't take me days (yes days) to delete the next 10 million files 🙂 …. I say "days" because I tried so many alternatives, and the times involved increase disproportionatly to the numberof file involved .. so I've only tested 1 million in detail

BTW: Deleting the files via the "sorted list" of names is actually faster than rm -rf by a factor of 2.
and: rm -r was 30 times slower than the "sorted list" method

… but is "sorted" the issue here? or is it more related to a hashing(or whatever) method of storage used by ext4?

The thing which quite puzzles me is that each call to rm filename is unrelated to the previous one .. (well, at least it is that way from the 'bash' perspective)

I'm using Ubuntu / bash / 'ext4' / SATA II drive.

Best Answer

rm -r is expected to be slow as its recursive. A depth first traversal has to be made on the directory structure.

Now how did you create 10 million files ? did u use some script which loops on some order ? 1.txt,2.txt,3.txt... if yes then those files may too be allocated on same order in contigous blocks in hdd.so deleting on same order will be faster.

"ls -f" will enable -aU which lists in directory order which is again recursive.

Related Question