Removing all non-ascii characters from a workflow (file)

asciitext processing

How would I remove all non-ascii characters from one file? Would there be a specific command to perform this?

grep --colour='auto' -P -n'[^\x00-\x7]' /usr/local/...

I believe this finds the characters within the workflow, but how would I remove all the instances of the characters in question?

Best Answer

ASCII characters are characters in the range from 0 to 177 (octal) inclusively.

To delete characters outside of this range in a file, use

LC_ALL=C tr -dc '\0-\177' <file >newfile

The tr command is a utility that works on single characters, either substituting them with other single characters (transliteration), deleting them, or compressing runs of the same character into a single character.

The command above would read from file and write the modified content to newfile. The -d option to tr makes the utility delete characters (instead of transliterating them), and -c makes it consider characters outside the given interval (instead of inside).

LC_ALL=C makes sure that every byte value makes up a valid character. Without it, some tr implementations would abort if they found sequences of bytes that don't form valid characters in the locale's character encoding.


To replace the original file with the modified one, use

LC_ALL=C tr -dc '\0-\177' <file >newfile &&
mv newfile file

This renames the new file to the name of the old file after tr has completed successfully. If tr does not complete successfully, either because it could not read the original file or not write to the new file, the original file will be left unchanged.

Alternatively, to preserve as much as possible of the meta data (permissions etc.) of the original file, use

cp file tmpfile &&
LC_ALL=C tr -dc '\0-\177' <tmpfile >file &&
rm tmpfile
Related Question