Print out unicode values of stdin

command linedebuggingtext;unicode

I use od to print me the octal or hex of a file/stdin/string. This lets me see the ASCII, or UTF-8 encoded, values of my stdin.

But we don't live in ASCIIland anymore. Is there any command that will print out the unicode values/codepoints for the (presume) utf-8 encoded input? I want to know what unicode characters I'm seeing?

Best Answer

You can use this if you are on a little endian system:

iconv -f utf-8 -t ucs-4le | od -tx4

or this if you are on a big endian system:

iconv -f utf-8 -t ucs-4be | od -tx4

Related Solutions

Bash – How to Convert Unicode Codepoint to Printable Character

You can use bash's echo or /bin/echo from GNU coreutils in combination with iconv:

echo -ne '\x09\x65' | iconv -f utf-16be

By default iconv converts to your locales encoding. Perhaps more portable than relying on a specific shell or echo command is Perl. Most any UNIX system I am aware of while have Perl available and it even have several Windows ports.

perl -C -e 'print chr 0x0965'

Most of the time when I need to do this, I'm in an editor like Vim/GVim which has built-in support. While in insert mode, hit Ctrl-V followed by u, then type four hex characters. If you want a character beyond U+FFFF, use a capital U and type 8 hex characters. Vim also supports custom easy to make keymaps. It converts a series of characters to another symbol. For example, I have a keymap I developed called www, it converts TM to ™, (C) to ©, (R) to ®, and so on. I also have a keymap for Klingon for when that becomes necessary. I'm sure Emacs has something similar. If you are in a GTK+ app which includes GVim and GNOME Terminal, you can try Control-Shift-u followed by 4 hex characters to create a Unicode character. I'm sure KDE/Qt has something similar.

UPDATE: As of Bash 4.2, it seems to be a built in feature now:

echo $'\u0965'

UPDATE: Also, nowadays a Python example would probably be preferred to Perl. This works in both Python 2 and 3:

python -c 'print(u"\u0965")'

How to do a regex search in a UTF-16LE file while in a UTF-8 locale

My answer is essentially the same as in your other question on this topic:

$ iconv -f UTF-16LE -t UTF-8 myfile.txt | grep pattern

As in the other question, you might need line ending conversion as well, but the point is that you should convert the file to the local encoding so you can use native tools directly.

Best Answer

Related Solutions

Bash – How to Convert Unicode Codepoint to Printable Character

How to do a regex search in a UTF-16LE file while in a UTF-8 locale

Related Question