Linux – How to recode to UTF-8 conditionally

I'm unifying the encoding of a large bunch of text files, gathered over time on different computers. I'm mainly going from ISO-8859-1 to UTF-8. This nicely converts one file:

recode ISO-8859-1..UTF-8 file.txt

I of course want to do automated batch processing for all the files, and simply running the above for each file has the problem that files whose already encoded in UTF-8, will have their encoding broken. (For instance, the character 'ä' originally in ISO-8859-1 will appear like this, viewed as UTF-8, if the above recode is done twice: � -> ä -> Ã¤)

My question is, what kind of script would run recode only if needed, i.e.
only for files that weren't already in the target encoding (UTF-8 in my case)?

From looking at recode man page, I couldn't figure out how to do something like this. So I guess this boils down to how to easily check the encoding of a file, or at least if it's UTF-8 or not. This answer implies you could recognise valid UTF-8 files with recode, but how? Any other tool would be fine too, as long as I could use the result in a conditional in a bash script…

#!/bin/bash # Find the current encoding of the file encoding=$(file -i "$2" | sed "s/.*charset=$.*$$/\1/") if [ ! "$1" == "${encoding}" ] then # Encodings differ, we have to encode echo "recoding from ${encoding} to $1 file : $2" recode ${encoding}..$1 $2 fi

Linux – How to recode to UTF-8 conditionally

Best Answer

Related Question

Best Answer

Related Solutions

Exporting UTF-8 text from LibreOffice without byte order mark

How to make UltraEdit save in UTF-8 without the byte order mark (BOM)

Related Question