MacOS – Mac OS X Command Line application that can convert text encodings from one type to another? (Specifically to convert Mac OS Roman to utf8)

automationcommand lineinternationalizationmacostext;

I would like to call a command line utility in Mac OS X 10.8 that gives me the ability to convert a text file saved in standard Western Mac OS Roman encoding to the more generic UTF-8.

I will be calling the utility from an AppleScript that I have created. AppleScript is extremely slow when working with very large text blocks. As such, I want to do my text parsing and conversion using the OS X command line. I have found a tool called, "sed", which allows me to do the text parsing. However, there are still many elements of the file that need to be cleaned up, characters that appear as garbage if the file is opened as utf-8 (e.g. smart quotes and ellipses).

I am thinking that forcing a text encoding conversion may help to eliminate all non-utf8 characters in the file. However, I cannot see how "sed" can easily convert the text encoding.

I will have already saved the temp txt file, as MacRoman, to disk using the built-in AppleScript routines.

Requirements:

  • Command-line for performance
  • Prefer native tools since other users of my script won't necessarily have the proper toolset if it's not built-in. (Although I could add a check to my script and abort if a needed tool isn’t present)

Do any of you have any ideas as to a built-in command-line tool that can convert text encoding or an existing package that is superior for this task?

Best Answer

Another way to convert non-ASCII characters to ASCII variants is to use iconv -t ASCII//TRANSLIT:

$ echo ‘’“”–—…äé | iconv -t ASCII//TRANSLIT
''""--..."a'e

ASCII//IGNORE would remove non-ASCII characters, but you can also do that with for example tr -dc '\0-\177'.