Recursively remove all invalid characters from text files in place

command lineterminalutf-8

I have several thousand text files, some of which contain invalid UTF-8 characters. I want to recursively remove all invalid characters from these files in place.

I am aware that many similar questions have been asked before, such as: how to remove non UTF-8 characters from text file but I have not found one that is both recursive and operates in place.

Best Answer

The great thing about UNIX commands is that you can combine them together. iconv doesn't know how to recurse into directories, but find does. It can call iconv on every file it sees.

(These commands will convert all files in the current directory and all directories within. Make sure you are in the directory you want to convert all files in recursively.)

To change all files with the extension .txt:

find . -type f -name '*.txt' -print0 | 
    while IFS= read -r -d $'\0' filename; do 
        iconv -f utf-8 -t utf-8 -c "$filename" > "$filename".iconv_cleaned_utf8
        mv "$filename".iconv_cleaned_utf8 "$filename"
    done

I suppose this code requires some explanation. What it does is:

find prints out all the filenames of the files involved, separated by a null byte (the null byte is the only invalid character for a file path)
bash reads the filenames and loops through them
iconv converts the file to a tempfile with an extra extension
we mv the tempfile to replace the original file.

If they have different extensions (this is for any and all files under the current directory), remove the -name *.txt

It's a bit cleaner if you have the sponge utility from moreutils, but that is not installed by default.

find . -type f -name '*.txt' -print0 | 
    while IFS= read -r -d $'\0' filename; do 
        iconv -f utf-8 -t utf-8 -c "$filename" | sponge "$filename"
    done

Related Solutions

Bash shell script to locate and remove substring within a filename

ls | perl -nl -e '/(.*)(S[0-9]+E[0-9]+).*(\.mp4)/ && print "mv \"" . $_ . "\" \"". $1 . $2 . $3 . "\""'

How does this work? First ls outputs the list of files, one per line, like so:

The.Big.Bang.Theory.S01E01.xxxxxxxx.mp4
The.Big.Bang.Theory.S01E02.somecrap.mp4
The.Big.Bang.Theory.S04E12.otherjunk.mp4

Then perl -nl splits this into lines, feeding each to the regex, much like awk*. The regex captures 3 groups (denoted by parentheses), first the bit before SxxEyy, then that, then the file suffix. It then simply assembles a mv command suitable for renaming the files, like so:

mv "The.Big.Bang.Theory.S01E01.xxxxxxxx.mp4" "The.Big.Bang.Theory.S01E01.mp4"
mv "The.Big.Bang.Theory.S01E02.somecrap.mp4" "The.Big.Bang.Theory.S01E02.mp4"
mv "The.Big.Bang.Theory.S04E12.otherjunk.mp4" "The.Big.Bang.Theory.S04E12.mp4"

This can then be inspected and once you're satisfied it does what you want, piped into a shell by appending: | sh.

*awk would normally be a good tool to use for this, but sadly only GNU awk supports regex capture groups and Mac OS X doesn't include gawk by default.

MacOS – Mac OS X Command Line application that can convert text encodings from one type to another? (Specifically to convert Mac OS Roman to utf8)

Another way to convert non-ASCII characters to ASCII variants is to use iconv -t ASCII//TRANSLIT:

$ echo ‘’“”–—…äé | iconv -t ASCII//TRANSLIT
''""--..."a'e

ASCII//IGNORE would remove non-ASCII characters, but you can also do that with for example tr -dc '\0-\177'.

Best Answer

Related Solutions

Bash shell script to locate and remove substring within a filename

MacOS – Mac OS X Command Line application that can convert text encodings from one type to another? (Specifically to convert Mac OS Roman to utf8)

Related Question