Linux – How to recode to UTF-8 conditionally

character encodingconversionlinuxunixutf-8

I'm unifying the encoding of a large bunch of text files, gathered over time on different computers. I'm mainly going from ISO-8859-1 to UTF-8. This nicely converts one file:

recode ISO-8859-1..UTF-8 file.txt

I of course want to do automated batch processing for all the files, and simply running the above for each file has the problem that files whose already encoded in UTF-8, will have their encoding broken. (For instance, the character 'ä' originally in ISO-8859-1 will appear like this, viewed as UTF-8, if the above recode is done twice: � -> ä -> ä)

My question is, what kind of script would run recode only if needed, i.e.
only for files that weren't already in the target encoding (UTF-8 in my case)?

From looking at recode man page, I couldn't figure out how to do something like this. So I guess this boils down to how to easily check the encoding of a file, or at least if it's UTF-8 or not. This answer implies you could recognise valid UTF-8 files with recode, but how? Any other tool would be fine too, as long as I could use the result in a conditional in a bash script…

Best Answer

This message is quite old, but I think I can contribute to this problem :
First create a script named recodeifneeded :

#!/bin/bash
# Find the current encoding of the file
encoding=$(file -i "$2" | sed "s/.*charset=\(.*\)$/\1/")

if [ ! "$1" == "${encoding}" ]
then
# Encodings differ, we have to encode
echo "recoding from ${encoding} to $1 file : $2"
recode ${encoding}..$1 $2
fi

You can use it this way :

recodeifneeded utf-8 file.txt

So, if you like to run it recursively and change all *.txt files encodings to (let's say) utf-8 :

find . -name "*.txt" -exec recodeifneeded utf-8 {} \;

I hope this helps.

Related Question