How does grep decide that a file is binary

binarygreptext processing

I have a large utf-8 text file which I frequently search with grep. Recently grep began reporting that it was a binary file. I can continue to search it with grep -a, but I was wondering what change made it decide that the file was now binary.

I have a copy from last month where the file is no longer detected as binary, but it's not practical to diff them since they differ on > 20,000 lines.

file identifies my file as

UTF-8 Unicode English text, with very long lines

How can I find the characters/lines/etc. in my file which are triggering this change?

The similar, non-duplicate question 19907 covers the possibility of NUL but grep -Pc '[\x00-\x1F]' says that I don't have NUL or any other ANSI control chaarcters.

Best Answer

It appears to be the presence of the null character in the file.(displayed ^@ usually) I entered various control characters into a text file(like delete, ^?, for example), and only the null character caused grep to consider it a binary. This was only tested for grep. The less and diff commands, for instance, may have different methods. Control characters in general don't appear except in binaries. The exceptions are the whitespace characters: newline(^M), tab(^I), formfeed(^L), vertical tab(^K), and return(^J).

However, foreign characters, like arabic or chinese letters, are not standard ascii, and perhaps could be confused with control characters. Perhaps that's why it's only the null character.

You can test it out for yourself by insterting control characters into a text file using the text editor vim. Just go to insert mode, press control-v, and then the control character.

Related Solutions

Bash – How to use bash script to read binary file content

If you want to stick with shell utilities, you can use head to extract a number of bytes, and od to convert a byte into a number.

export LC_ALL=C    # make sure we aren't in a multibyte locale
n=$(head -c 1 | od -An -t u1)
string=$(head -c $n)

However, this does not work for binary data. There are two problems:

Command substitution $(…) strips final newlines in the command output. There's a fairly easy workaround: make sure the output ends in a character other than a newline, then strip that one character.
```
string=$(head -c $n; echo .); string=${string%.}
```
Bash, like most shells, is bad at dealing with null bytes. As of bash 4.1, null bytes are simply dropped from the result of the command substitution. Dash 0.5.5 and pdksh 5.2 have the same behavior, and ATT ksh stops reading at the first null byte. In general, shells and their utilities aren't geared towards dealing with binary files. (Zsh is the exception, it's designed to support null bytes.)

If you have binary data, you'll want to switch to a language like Perl or Python.

<input_file perl -e '
  read STDIN, $c, 1 or die $!;    # read length byte
  $n = read STDIN, $s, ord($c);   # read data
  die $! if !defined $n;
  die "Input file too short" if ($n != ord($c));
  # Process $s here
'

<input_file python -c '
  import sys
  n = ord(sys.stdin.read(1))      # read length byte
  s = sys.stdin.read(n)           # read data
  if len(s) < n: raise ValueError("input file too short")
  # Process s here
'

How to do a regex search in a UTF-16LE file while in a UTF-8 locale

My answer is essentially the same as in your other question on this topic:

$ iconv -f UTF-16LE -t UTF-8 myfile.txt | grep pattern

As in the other question, you might need line ending conversion as well, but the point is that you should convert the file to the local encoding so you can use native tools directly.

Best Answer

Related Solutions

Bash – How to use bash script to read binary file content

How to do a regex search in a UTF-16LE file while in a UTF-8 locale

Related Question