I'm unifying the encoding of a large bunch of text files, gathered over time on different computers. I'm mainly going from ISO-8859-1 to UTF-8. This nicely converts one file:
recode ISO-8859-1..UTF-8 file.txt
I of course want to do automated batch processing for all the files, and simply running the above for each file has the problem that files whose already encoded in UTF-8, will have their encoding broken. (For instance, the character 'ä' originally in ISO-8859-1 will appear like this, viewed as UTF-8, if the above recode is done twice: � -> ä -> ä
)
My question is, what kind of script would run recode only if needed, i.e.
only for files that weren't already in the target encoding (UTF-8 in my case)?
From looking at recode man page, I couldn't figure out how to do something like this. So I guess this boils down to how to easily check the encoding of a file, or at least if it's UTF-8 or not. This answer implies you could recognise valid UTF-8 files with recode, but how? Any other tool would be fine too, as long as I could use the result in a conditional in a bash script…
Best Answer
This message is quite old, but I think I can contribute to this problem :
First create a script named recodeifneeded :
You can use it this way :
So, if you like to run it recursively and change all *.txt files encodings to (let's say) utf-8 :
I hope this helps.