Shell – Filter Files Generated by Find Command Using Parsed Output

awkfindshell

I'm writing a quick tool to inspect the contents of a node.js node_modules folder or python virtualenv for native dependencies. As a quick first approximation to this I wrote the following command.

find . | xargs file | awk '/C source/ {print $1} /ELF/ {print $1}'

I'm okay with false positives but not false negatives (e.g. files literally containing the string ELF or C source can be marked suspicious.), but this script also potentially breaks on long file names (because xargs will split them) and file names containing spaces (because awk will split on whitespace) and file names containing newlines (because find uses newlines to separate paths).

Is there a way to filter the paths generated by find by seeing if the output of file {} (possibly with some additional options to remove the path entirely from the output of file) matches a particular regular expression?

Best Answer

The key factor in reaching find enlightenment ;) is:

find's business is evaluating expressions -- not locating files. Yes, find certainly locates files; but that's really just a side effect.

--Unix Power Tools

There is an alternate approach to this question that it's worth knowing about (as also described in Unix Power Tools, in the section "Using -exec to Create Custom Tests"):

find . -type f -exec sh -c 'file -b "$1" | grep -iqE "^ELF|^C source"' sh {} \; -print

It's worth knowing about this filtering method since it can be used for many more things than simply printing the name of the file; just change the -print operator to any other operator you like (including another -exec operator) and do what you like with it.

There is a performance drawback to this command (which is also present in the other answer), which is that since we are using \; and not +, we are spawning a shell for every single file. Using + to pass multiple files at once to the sh command and processing them with a for loop gives a noticeable performance advantage:

find . -exec sh -c 'for f do file -b "$f" | grep -qE "^ELF|^C source" && printf %s\\n "$f"; done' sh {} +

You can see the comparison for yourself by running both of the following commands and comparing the output of time:

time find . -exec sh -c 'for f do file -b "$f" | grep -qE "^ELF|^C source" && printf %s\\n "$f"; done' sh {} +
time find . -exec sh -c 'file -b "$1" | grep -qE "^ELF|^C source" && printf %s\\n "$1"' sh {} \;

The real point, though, is:

Never run a shell for loop on a list of files that is output from find. Instead, either run the action you need to do on each file directly within find by using the -exec operator, or embed a shell for loop within a find command and do it that way.

Some additional reasons:

Related Solutions

Bash – Passing parsed output of sed to find (in this direction)

find doesn't take its args from stdin, it takes them from the command line, so:

find . -iname "$(sed 's/ /\*/g' <<< ' foo bar baz ')"

That will become:

find . -iname "foo*bar*baz*"

(the quotes are necessary to stop the shell from expanding the wildcard foo*bar*baz*)

which may or may not be what you were hoping to get. e.g. if you really wanted files beginning with either foo, bar, or baz rather than files matching the patter 'foo*bar*baz*' then you need to construct a regexp and use -regex or -iregex rather than -iname. or construct a more complicated find command like -iname 'foo*' -o -iname 'bar*' -o -iname 'baz*'

alternatively:

PATTERN='foo bar baz'
PATTERN=$(echo "$PATTERN" | sed -e 's/ /*/g')
# and/or do whatever else you need to do to transform $PATTERN to be what
# you need it to be...
find . -iname "$PATTERN"

Here's the one-liner in a bash script. it works exactly the same in the script as it does on the command-line.

$ cat ./test.sh
#! /bin/bash 
find . -iname "$(sed 's/ /\*/g' <<< ' foo bar baz ')"

$ ls -l
total 4
-rw-r--r-- 1 cas cas  0 Sep 26 16:53 doesntmatch
lrwxrwxrwx 1 cas cas 16 Sep 26 16:59 symlink -> xfoo-ybar-zbaz01
-rwxr-xr-x 1 cas cas 69 Sep 26 16:50 test.sh
-rw-r--r-- 1 cas cas  0 Sep 26 16:50 xfoo-ybar-zbaz01
-rw-r--r-- 1 cas cas  0 Sep 26 16:50 xfoo-ybar-zbaz02
-rw-r--r-- 1 cas cas  0 Sep 26 16:50 xfoo-ybar-zbaz03

$ ./test.sh 
./xfoo-ybar-zbaz01
./xfoo-ybar-zbaz03
./xfoo-ybar-zbaz02

$ find . -iname "$(sed 's/ /\*/g' <<< ' foo bar baz ')"
./xfoo-ybar-zbaz01
./xfoo-ybar-zbaz03
./xfoo-ybar-zbaz02

$ find . -iname "*foo*bar*baz*"
./xfoo-ybar-zbaz01
./xfoo-ybar-zbaz03
./xfoo-ybar-zbaz02

Bash – Recursive search for a pattern, then for each match print out the specific SEQUENCE: line number, file name, and no file contents

Using grep

Why can't you just use the -r switch to grep to recurse the filesystem instead of making use of find? There are 2 additional switches I'd use too, instead of the -n switch.

$ grep -rHn PATTERN <DIR> | cut -d":" -f1-2

Example #1

$ grep -rHn PATH ~/.bashrc | cut -d":" -f1-2
/home/saml/.bashrc:25

Details

-r - recursively search through files + directories
-H - prints the name of the file if it matches (less restrictive than -l) i.e. it works with grep's other switches
-n - display the line number of the match

Example #2

$ grep -rHn PATH ~/.bash* | cut -d":" -f1-2
/home/saml/.bash_profile:10
/home/saml/.bash_profile:12
/home/saml/.bash_profile_askapache:99
/home/saml/.bash_profile_askapache:101
/home/saml/.bash_profile_askapache:118
/home/saml/.bash_profile_askapache:166
/home/saml/.bash_profile_askapache:218
/home/saml/.bash_profile_askapache:250
/home/saml/.bash_profile_askapache:314
/home/saml/.bash_profile_askapache:2317
/home/saml/.bash_profile_askapache:2323
/home/saml/.bashrc:25

Using find

$ find . -exec sh -c 'grep -Hn PATTERN "$@" | cut -d":" -f1-2' {}  +

Example

$ find ~/.bash* -exec sh -c 'grep -Hn PATH "$@" | cut -d":" -f1-2' {}  +
/home/saml/.bash_profile:10
/home/saml/.bash_profile:12
/home/saml/.bash_profile_askapache:99
/home/saml/.bash_profile_askapache:101
/home/saml/.bash_profile_askapache:118
/home/saml/.bash_profile_askapache:166
/home/saml/.bash_profile_askapache:218
/home/saml/.bash_profile_askapache:250
/home/saml/.bash_profile_askapache:314
/home/saml/.bash_profile_askapache:2317
/home/saml/.bash_profile_askapache:2323
/home/saml/.bashrc:25

If you truly want to use find you can do something like this to exec grep upon finding the files using find.