MacOS – How to search 360,000 files, about 280 GB

command linemacossearch

I have a folder (technically in this case, a read-only mounted disk image) containing a ton of data I got by running Data Rescue (a data recovery app) on one of my large server drives. I did several different scan types and dumped all the files into one place. Data Rescue 'reconstructs' deleted files and often doesn't get it quite right. It can miscategorize the type of file it is, and it can mash separate files together.

I am seeking two specific PHP files (and maybe about 5 others if I get lucky). Most of these recovered files don't have names (0002, 0003 etc), so I have to search by content.

I've come up with 6 different strings that should be able to identify these specific files. So I need a way to search the contents of files, not in an Apple "magic search" kinda way, but in an old school "manually read through every file looking for a string-match" kinda way.

grep sounds like the obvious choice, but it's been nothing but problems. grep can search recursively, and it can decompress gzip, zip and bzip archives, which is all good. But after a few minutes of running, it starts streaming "too many open files" errors. I'm not sure why, it's as if grep doesn't close a file after it opens it to search in it. I've also had issues with grep just stopping… not quitting, not crashing, not going unresponsive, but not using any more CPU, not reading anything from the disk, just sitting idle when it should be searching. I ALSO had trouble running multiple grep searches at once. grep seems to load files line by line, so something like a disk image gets the entire thing loaded into memory before searching. But there is only one file in this whole bundle that is larger than the amount of RAM I have. So as long as I do one grep at a time, I should be fine.

This is the command I'm using (wrapped in a script that does several commands to different output files, with some status outputting):
zfgrep -l -r -a -J -i -s -U -n "#32cd32" /Volumes/\'Storage\'\ Original\ Recovery > 32cd32.txt

This will run for a while, then it will hang. I'll get some results but not a full search. If I remove the -s, I get the flood of too many open files errors. Then, at someone else's suggestion, I use find to feed files to grep one at a time, like so:

find /Volumes/\'Storage\'\ Original\ Recovery -exec zfgrep -l -r -a -J -i -s -U -n "#32cd32" {} \; -print > 32cd32.txt

But that command has the exact same problems.

So this leaves me stuck. How can I search every single file on this disk image, including the archives, for some plain text strings. Including binary data files that may have been incorrectly merged with plain text files? This doesn't seem like that tough of a task for a modern multicore computer with a current OS, lots of RAM and a SSD.

I actually would prefer a GUI option, but at this point I'll take any solution that works.

Also I originally started trying to do this using BBEdit, but it was skipping a LOT of filetypes even when you tell it to search all files. Even files that are XML based. I was very surprised by this.

Best Answer

Using find ... -exec grep -r effectively traverses the whole directory several times (once as part of the find, once as part of each grep -r) which may lead to the errors you see. So you should either get rid of the find or the -r. As you use the grep part to identify the files to be collected, it's probably the -r in your case.