I have a large utf-8 text file which I frequently search with grep
. Recently grep
began reporting that it was a binary file. I can continue to search it with grep -a
, but I was wondering what change made it decide that the file was now binary.
I have a copy from last month where the file is no longer detected as binary, but it's not practical to diff
them since they differ on > 20,000 lines.
file
identifies my file as
UTF-8 Unicode English text, with very long lines
How can I find the characters/lines/etc. in my file which are triggering this change?
The similar, non-duplicate question 19907 covers the possibility of NUL but grep -Pc '[\x00-\x1F]'
says that I don't have NUL or any other ANSI control chaarcters.
Best Answer
It appears to be the presence of the null character in the file.(displayed ^@ usually) I entered various control characters into a text file(like delete, ^?, for example), and only the null character caused grep to consider it a binary. This was only tested for grep. The less and diff commands, for instance, may have different methods. Control characters in general don't appear except in binaries. The exceptions are the whitespace characters: newline(^M), tab(^I), formfeed(^L), vertical tab(^K), and return(^J).
However, foreign characters, like arabic or chinese letters, are not standard ascii, and perhaps could be confused with control characters. Perhaps that's why it's only the null character.
You can test it out for yourself by insterting control characters into a text file using the text editor vim. Just go to insert mode, press control-v, and then the control character.