Files – How to Classify Files as Binary or Text?

filestext;

Standard Unix utilities like grep and diff use some heuristic to classify files as "text" or "binary". (E.g. grep's output may include lines like Binary file frobozz matches.)

Is there a convenient test one can apply in a zsh script to perform a similar "text/binary" classification? (Other than something like grep '' somefile | grep -q Binary.)

(I realize that any such test would necessarily be heuristic, and therefore imperfect.)

Best Answer

If you ask file for just the mime-type you'll get many different ones like text/x-shellscript, and application/x-executable etc, but I imagine if you just check for the "text" part you should get good results. Eg (-b for no filename in output):

file -b --mime-type filename | sed 's|/.*||'

Related Solutions

Which extension to use for text files? (Unix/Linux)

UNIX/Linux does not have the same early DOS / CP/M heritage that Windows does. So extensions are generally less significant to most UNIX utilities and tools.

I usually use a command-line only environment. Extensions in such an environment under Linux aren't really significant except as a convenience to the operator or user. (I don't have enough experience with KDE or GNOME to know how their filemanagers deal with extensions.)

But such convenience is usually important. If config.ini is really in Microsoft-standard ".ini" format, I'd let the extension stand. Plain old text files usually carry no extension in Linux, but this isn't universal for all programs configuration files. The programmer usually gets to decide that.

I think ".txt" is useful under Linux if you want to emphasize that it's NOT a configuration file or other machine-readable document. However, in source distributions, the convention is to name such files all caps without an extension (i.e. README, INSTALL, COPYING, etc.)

There are some standards and conventions but nothing stopping you from naming anything whatever you want, unless you are sharing things with others.

In Windows, naming a file .exe indicates to the shell (usually explorer.exe) that it's an executable file. UNIX builds this knowledge into the file system's permissions. If the proper x bits (see man chmod) are set, it is recognized as executable by shells and kernel functions (I believe). Beyond this, Linux doesn't care, most shells won't care, and most programs look in the file to find it's "type."

Of course, there's the nice command file which can analyze the file and tell you what it is with a degree of certainty. I believe if it can't match the data in the file with any known type, and if it contains only printable ASCII/Unicode characters, then it assumes its a text file.

@Bruce Ediger below is absolutely correct. There is nothing in the kernel or filesystem level, i.e. Linux itself, enforcing or caring that the contents of a file needs to match up with its name, or the program that is supposed to understand it. This doesn't mean it's not possible to create a shell or launcher utility to do things based on filename.

Finding Non-Binary Files – How to Find All Non-Binary Files

I'd use file and pipe the output into grep or awk to find text files, then extract just the filename portion of file's output and pipe that into xargs.

something like:

file * | awk -F: '/ASCII text/ {print $1}' | xargs -d'\n' -r flip -u

Note that the grep searches for 'ASCII text' rather than any just 'text' - you probably don't want to mess with Rich Text documents or unicode text files etc.

You can also use find (or whatever) to generate a list of files to examine with file:

find /path/to/files -type f -exec file {} + | \
  awk -F: '/ASCII text/ {print $1}' | xargs -d'\n' -r flip -u

The -d'\n' argument to xargs makes xargs treat each input line as a separate argument, thus catering for filenames with spaces and other problematic characters. i.e. it's an alternative to xargs -0 when the input source doesn't or can't generate NULL-separated output (such as find's -print0 option). According to the changelog, xargs got the -d/--delimiter option in Sep 2005 so should be in any non-ancient linux distro (I wasn't sure, which is why I checked - I just vaguely remembered it was a "recent" addition).

Note that a linefeed is a valid character in filenames, so this will break if any filenames have linefeeds in them. For typical unix users, this is pathologically insane, but isn't unheard of if the files originated on Mac or Windows machines.

Also note that file is not perfect. It's very good at detecting the type of data in a file but can occasionally get confused.

I have used numerous variations of this method many times in the past with success.

Best Answer

Related Solutions

Which extension to use for text files? (Unix/Linux)

Finding Non-Binary Files – How to Find All Non-Binary Files

Related Question