Finding Non-Binary Files – How to Find All Non-Binary Files

filesfindnewlinestext;

Is it possible to use the find command to find all the "non-binary" files in a directory? Here's the problem I'm trying to solve.

I've received an archive of files from a windows user. This archive contains source code and image files. Our build system doesn't play nice with files that have windows line endings. I have a command line program (flip -u) that will flip line endings between *nix and windows. So, I'd like to do something like this

find . -type f | xargs flip -u

However, if this command is run against an image file, or other binary media file, it will corrupt the file. I realize I could build a list of file extensions and filter with that, but I'd rather have something that's not reliant on me keeping that list up to date.

So, is there a way to find all the non-binary files in a directory tree? Or is there an alternate solution I should consider?

Best Answer

I'd use file and pipe the output into grep or awk to find text files, then extract just the filename portion of file's output and pipe that into xargs.

something like:

file * | awk -F: '/ASCII text/ {print $1}' | xargs -d'\n' -r flip -u

Note that the grep searches for 'ASCII text' rather than any just 'text' - you probably don't want to mess with Rich Text documents or unicode text files etc.

You can also use find (or whatever) to generate a list of files to examine with file:

find /path/to/files -type f -exec file {} + | \
  awk -F: '/ASCII text/ {print $1}' | xargs -d'\n' -r flip -u

The -d'\n' argument to xargs makes xargs treat each input line as a separate argument, thus catering for filenames with spaces and other problematic characters. i.e. it's an alternative to xargs -0 when the input source doesn't or can't generate NULL-separated output (such as find's -print0 option). According to the changelog, xargs got the -d/--delimiter option in Sep 2005 so should be in any non-ancient linux distro (I wasn't sure, which is why I checked - I just vaguely remembered it was a "recent" addition).

Note that a linefeed is a valid character in filenames, so this will break if any filenames have linefeeds in them. For typical unix users, this is pathologically insane, but isn't unheard of if the files originated on Mac or Windows machines.

Also note that file is not perfect. It's very good at detecting the type of data in a file but can occasionally get confused.

I have used numerous variations of this method many times in the past with success.