Removing all non-ascii characters from a workflow (file)

asciitext processing

How would I remove all non-ascii characters from one file? Would there be a specific command to perform this?

grep --colour='auto' -P -n'[^\x00-\x7]' /usr/local/...

I believe this finds the characters within the workflow, but how would I remove all the instances of the characters in question?

Best Answer

ASCII characters are characters in the range from 0 to 177 (octal) inclusively.

To delete characters outside of this range in a file, use

LC_ALL=C tr -dc '\0-\177' <file >newfile

The tr command is a utility that works on single characters, either substituting them with other single characters (transliteration), deleting them, or compressing runs of the same character into a single character.

The command above would read from file and write the modified content to newfile. The -d option to tr makes the utility delete characters (instead of transliterating them), and -c makes it consider characters outside the given interval (instead of inside).

LC_ALL=C makes sure that every byte value makes up a valid character. Without it, some tr implementations would abort if they found sequences of bytes that don't form valid characters in the locale's character encoding.

To replace the original file with the modified one, use

LC_ALL=C tr -dc '\0-\177' <file >newfile &&
mv newfile file

This renames the new file to the name of the old file after tr has completed successfully. If tr does not complete successfully, either because it could not read the original file or not write to the new file, the original file will be left unchanged.

Alternatively, to preserve as much as possible of the meta data (permissions etc.) of the original file, use

cp file tmpfile &&
LC_ALL=C tr -dc '\0-\177' <tmpfile >file &&
rm tmpfile

Related Solutions

Grep for lines with all words greater than 10 characters in length

Your condition might be more easily expressed in the contrapositive: instead of including lines where all words have length > 10, exclude those lines which have a word with length <= 10. Since grep supports both negation and word-matching, this could be written as, say:

grep -vwE '\w{1,10}'

-v negates the match
-w means that the regex should match a whole word

As Sundeep noted, we should use {1,10} to avoid matching the empty string (and thus every line).

Text Processing – Remove Accents from Characters

You can try iconv, with the //TRANSLIT (transliteration) option

Ex. given

$ cat file
ë
ê
Ý,text
Ò
É

then

$ iconv -t ASCII//TRANSLIT file
e
e
Y,text
O
E

Best Answer

Related Solutions

Grep for lines with all words greater than 10 characters in length

Text Processing – Remove Accents from Characters

Related Question