Recursively remove all invalid characters from text files in place

command lineterminalutf-8

I have several thousand text files, some of which contain invalid UTF-8 characters. I want to recursively remove all invalid characters from these files in place.

I am aware that many similar questions have been asked before, such as: how to remove non UTF-8 characters from text file but I have not found one that is both recursive and operates in place.

Best Answer

The great thing about UNIX commands is that you can combine them together. iconv doesn't know how to recurse into directories, but find does. It can call iconv on every file it sees.

(These commands will convert all files in the current directory and all directories within. Make sure you are in the directory you want to convert all files in recursively.)

To change all files with the extension .txt:

find . -type f -name '*.txt' -print0 | 
    while IFS= read -r -d $'\0' filename; do 
        iconv -f utf-8 -t utf-8 -c "$filename" > "$filename".iconv_cleaned_utf8
        mv "$filename".iconv_cleaned_utf8 "$filename"
    done

I suppose this code requires some explanation. What it does is:

  • find prints out all the filenames of the files involved, separated by a null byte (the null byte is the only invalid character for a file path)
  • bash reads the filenames and loops through them
  • iconv converts the file to a tempfile with an extra extension
  • we mv the tempfile to replace the original file.

If they have different extensions (this is for any and all files under the current directory), remove the -name *.txt

It's a bit cleaner if you have the sponge utility from moreutils, but that is not installed by default.

find . -type f -name '*.txt' -print0 | 
    while IFS= read -r -d $'\0' filename; do 
        iconv -f utf-8 -t utf-8 -c "$filename" | sponge "$filename"
    done