Bash – Why is deleting files by name painfully slow and also exceptionally fast

bashext4filesystems

Faux pas: The "fast" method I mention below, is not 60 times faster than the slow one. It is 30 times faster. I'll blame the mistake on the hour (3AM is not my best time of day for clear thinking :)..

Update: I've added a summary of test times (below).
There seem to be two issues involved with the speed factor:

The choice of command used (Time comparisons shown below)
The nature of large numbers of files in a directory… It seems that "big is bad". Things get disoprportionately slower as the numbers increase..

All the tests have been done with 1 million files.
(real, user, and sys times are in the test scripts)
The test scripts can be found at paste.ubuntu.com

#
# 1 million files           
# ===============
#
#  |time   |new dir   |Files added in  ASCENDING order  
#  +----   +-------   +------------------------------------------------- 
#   real    01m 33s    Add files only (ASCENDING order) ...just for ref.
#   real    02m 04s    Add files, and make 'rm' source (ASCENDING order) 
#                      Add files, and make 'rm' source (DESCENDING order) 
#   real    00m 01s    Count of filenames
#   real    00m 01s    List of filenames, one per line
#   ----    -------    ------
#   real    01m 34s    'rm -rf dir'
#   real    01m 33s    'rm filename' via rm1000filesPerCall   (1000 files per 'rm' call)
#   real    01m 40s    'rm filename' via  ASCENDING algorithm (1000 files per 'rm' call)
#   real    01m 46s    'rm filename' via DESCENDING algorithm (1000 files per 'rm' call)
#   real    21m 14s    'rm -r dir'
#   real    21m 27s    'find  dir -name "hello*" -print0 | xargs -0 -n 1000 rm'
#   real    21m 56s    'find  dir -name "hello*" -delete'
#   real    23m 09s    'find  dir -name "hello*" -print0 | xargs -0 -P 0 rm'
#   real    39m 44s    'rm filename' (one file per rm call) ASCENDING
#   real    47m 26s    'rm filename' (one file per rm call) UNSORTED
#

I recently created and deleted 10 million empty test files.
Deleting files on a name by name basis (ie rm filename), I found out the hard way that there is a huge time difference between 2 different methods…

Both methods use the exact same rm filename command.

Update: as it turns out, the commands were not exactly the same… One of them was sending 1000 filenames at a time to 'rm'… It was a shell brace-expansion issue where I thought each filename was being written to the feeder file on a line of its own, but actually it was 1000 per line

The filnames are provide via a 'feeder file' into a while read loop..
The feeder file is the output of ls -1 -f
The methods are identical in all reaspects, except for one thing:

the slow method uses the unsorted feeder file direct from ls -1 -f
the fast method uses a sorted version of that same unsorted file

I'm not sure whether the sorting is ths issue here, or is it perhaps that the sorted feeder file just happens to match the sequence in which the files were created (I used a simple ascending integer algorithm)

For 1 million files, the fast rm filename method is 60 times faster than the slow method… again, I don't know if this is a "sorting" issue, or a behind-the-scenes hash table issue… I suspect it is not a simple sorting issue, because why would ls -1 -f intentionally give me an unsort listing of a freshly added "sorted" sequence of filenames…

I'm just wondering what is going on here, so it doesn't take me days (yes days) to delete the next 10 million files 🙂 …. I say "days" because I tried so many alternatives, and the times involved increase disproportionatly to the numberof file involved .. so I've only tested 1 million in detail

BTW: Deleting the files via the "sorted list" of names is actually faster than rm -rf by a factor of 2.
and: rm -r was 30 times slower than the "sorted list" method

… but is "sorted" the issue here? or is it more related to a hashing(or whatever) method of storage used by ext4?

The thing which quite puzzles me is that each call to rm filename is unrelated to the previous one .. (well, at least it is that way from the 'bash' perspective)

I'm using Ubuntu / bash / 'ext4' / SATA II drive.

Best Answer

rm -r is expected to be slow as its recursive. A depth first traversal has to be made on the directory structure.

Now how did you create 10 million files ? did u use some script which loops on some order ? 1.txt,2.txt,3.txt... if yes then those files may too be allocated on same order in contigous blocks in hdd.so deleting on same order will be faster.

"ls -f" will enable -aU which lists in directory order which is again recursive.

Using grep

Why can't you just use the -r switch to grep to recurse the filesystem instead of making use of find? There are 2 additional switches I'd use too, instead of the -n switch.

$ grep -rHn PATTERN <DIR> | cut -d":" -f1-2

Example #1

$ grep -rHn PATH ~/.bashrc | cut -d":" -f1-2
/home/saml/.bashrc:25

Details

-r - recursively search through files + directories
-H - prints the name of the file if it matches (less restrictive than -l) i.e. it works with grep's other switches
-n - display the line number of the match

Example #2

$ grep -rHn PATH ~/.bash* | cut -d":" -f1-2
/home/saml/.bash_profile:10
/home/saml/.bash_profile:12
/home/saml/.bash_profile_askapache:99
/home/saml/.bash_profile_askapache:101
/home/saml/.bash_profile_askapache:118
/home/saml/.bash_profile_askapache:166
/home/saml/.bash_profile_askapache:218
/home/saml/.bash_profile_askapache:250
/home/saml/.bash_profile_askapache:314
/home/saml/.bash_profile_askapache:2317
/home/saml/.bash_profile_askapache:2323
/home/saml/.bashrc:25

Using find

$ find . -exec sh -c 'grep -Hn PATTERN "$@" | cut -d":" -f1-2' {}  +

Example

$ find ~/.bash* -exec sh -c 'grep -Hn PATH "$@" | cut -d":" -f1-2' {}  +
/home/saml/.bash_profile:10
/home/saml/.bash_profile:12
/home/saml/.bash_profile_askapache:99
/home/saml/.bash_profile_askapache:101
/home/saml/.bash_profile_askapache:118
/home/saml/.bash_profile_askapache:166
/home/saml/.bash_profile_askapache:218
/home/saml/.bash_profile_askapache:250
/home/saml/.bash_profile_askapache:314
/home/saml/.bash_profile_askapache:2317
/home/saml/.bash_profile_askapache:2323
/home/saml/.bashrc:25

If you truly want to use find you can do something like this to exec grep upon finding the files using find.

Why is writing SLOW on raw device, and FAST on filesystem (USB key)

Now that I look again, I realized you said this was a usb key ( flash drive ) not a hard drive. Flash memory can only be erased in large blocks, and individual sectors can not be written without erasing them ( and the whole block they are in ) first. Since software expects to be able to write wherever it wants on the disk at any time, the disk has translation logic in it to transparently handle the erasing. How this is done has a dramatic affect on write performance. Many devices use an algorithm for most of the disk that handles sequential writes very well, but sucks at random writes. The area near the start of the disk is normally used by the FAT in the FAT filesystem they come preformatted with, and this area is randomly written to frequently, so they use a different algorithm in this area that is slower at sequential writes, but not terrible at random writes.

Thus, I am now pretty sure that my initial guess I added as a comment was right. What you are seeing when you write to the filesystem is the performance of the rest of the disk, and when you dd at offset zero, you are writing to the fat area. If you seek the dd destination a few hundred mb in, it should speed up considerably.