Shell – Filter Files Generated by Find Command Using Parsed Output

awkfindshell

I'm writing a quick tool to inspect the contents of a node.js node_modules folder or python virtualenv for native dependencies. As a quick first approximation to this I wrote the following command.

find . | xargs file | awk '/C source/ {print $1} /ELF/ {print $1}'

I'm okay with false positives but not false negatives (e.g. files literally containing the string ELF or C source can be marked suspicious.), but this script also potentially breaks on long file names (because xargs will split them) and file names containing spaces (because awk will split on whitespace) and file names containing newlines (because find uses newlines to separate paths).

Is there a way to filter the paths generated by find by seeing if the output of file {} (possibly with some additional options to remove the path entirely from the output of file) matches a particular regular expression?

Best Answer

The key factor in reaching find enlightenment ;) is:

find's business is evaluating expressions -- not locating files. Yes, find certainly locates files; but that's really just a side effect.

--Unix Power Tools

There is an alternate approach to this question that it's worth knowing about (as also described in Unix Power Tools, in the section "Using -exec to Create Custom Tests"):

find . -type f -exec sh -c 'file -b "$1" | grep -iqE "^ELF|^C source"' sh {} \; -print

It's worth knowing about this filtering method since it can be used for many more things than simply printing the name of the file; just change the -print operator to any other operator you like (including another -exec operator) and do what you like with it.


There is a performance drawback to this command (which is also present in the other answer), which is that since we are using \; and not +, we are spawning a shell for every single file. Using + to pass multiple files at once to the sh command and processing them with a for loop gives a noticeable performance advantage:

find . -exec sh -c 'for f do file -b "$f" | grep -qE "^ELF|^C source" && printf %s\\n "$f"; done' sh {} +

You can see the comparison for yourself by running both of the following commands and comparing the output of time:

time find . -exec sh -c 'for f do file -b "$f" | grep -qE "^ELF|^C source" && printf %s\\n "$f"; done' sh {} +
time find . -exec sh -c 'file -b "$1" | grep -qE "^ELF|^C source" && printf %s\\n "$1"' sh {} \;

The real point, though, is:

Never run a shell for loop on a list of files that is output from find. Instead, either run the action you need to do on each file directly within find by using the -exec operator, or embed a shell for loop within a find command and do it that way.

Some additional reasons:

Related Question