UTF-8
is a variable length encoding of Unicode. It is designed to be superset of ASCII. See Wikipedia for details of the encoding. \x00 \x01 \xF6 \x15
would be UCS-4BE
or UTF-32BE
encoding.
To get from the Unicode code point to the UTF-8 encoding, assuming the locale's charmap is UTF-8 (see the output of locale charmap
), it's just:
$ printf '\U1F615\n'
?
$ echo -e '\U1F615'
?
$ confused_face=$'\U1F615'
The latter will be in the next version of the POSIX standard.
AFAIK, that syntax was introduced in 2000 by the stand-alone GNU printf
utility (as opposed to the printf
utility of the GNU shell), brought to echo
/printf
/$'...'
builtins first by zsh
in 2003, ksh93 in 2004, bash in 2010 (though not working properly there until 2014), but was obviously inspired by other languages.
ksh93
also supports it as printf '\x1f615\n'
and printf '\u{1f615}\n'
.
$'\uXXXX'
and $'\UXXXXXXXX'
are supported by zsh
, bash
, ksh93
, mksh
and FreeBSD sh
, GNU printf
, GNU echo
.
Some require all the digits (as in \U0001F615
as opposed to \U1F615
) though that's likely to change in future versions as POSIX will allow fewer digits. In any case, you need all the digits if the \UXXXXXXXX
is to be followed by hexadecimal digits as in \U0001F615FOX
, as \U1F615FOX
would have been $'\U001F615F'OX
.
Some expand to the characters in the current locale's encoding at the time the string is parsed or at the time it is expanded, some only in UTF-8 regardless of the locale. If the character is not available in the current locale's encoding, the behaviour varies between shells.
So, for best portability, best is to only use it in UTF-8 locales and use all the digits, and use it in $'...'
:
printf '%s\n' $'\U0001F615'
Note that:
LC_ALL=C.UTF-8; printf '%s\n' $'\U0001F615'
or:
{
LC_ALL=C.UTF-8
printf '%s\n' $'\U0001F615'
}
Will not work with all shells (including bash
) because the $'\U0001F615'
is parsed before LC_ALL
is assigned. (also note that there's no guarantee that a system will have a locale called C.UTF-8
)
You'd need:
LC_ALL=C.UTF-8; eval "confused_face=$'\U0001F615'"
Or:
LC_ALL=C.UTF-8
printf '%s\n' $'\U0001F615'
(not within a compound command or function).
For the reverse, to get from the UTF-8 encoding to the Unicode code-point, see this other question or that one.
$ unicode ?
U+1F615 CONFUSED FACE
UTF-8: f0 9f 98 95 UTF-16BE: d83dde15 Decimal: 😕
?
Category: So (Symbol, Other)
Bidi: ON (Other Neutrals)
$ perl -CA -le 'printf "%x\n", ord shift' ?
1f615
Best Answer
You can use bash's echo or /bin/echo from GNU coreutils in combination with iconv:
By default iconv converts to your locales encoding. Perhaps more portable than relying on a specific shell or echo command is Perl. Most any UNIX system I am aware of while have Perl available and it even have several Windows ports.
Most of the time when I need to do this, I'm in an editor like Vim/GVim which has built-in support. While in insert mode, hit Ctrl-V followed by u, then type four hex characters. If you want a character beyond U+FFFF, use a capital U and type 8 hex characters. Vim also supports custom easy to make keymaps. It converts a series of characters to another symbol. For example, I have a keymap I developed called www, it converts TM to ™, (C) to ©, (R) to ®, and so on. I also have a keymap for Klingon for when that becomes necessary. I'm sure Emacs has something similar. If you are in a GTK+ app which includes GVim and GNOME Terminal, you can try Control-Shift-u followed by 4 hex characters to create a Unicode character. I'm sure KDE/Qt has something similar.
UPDATE: As of Bash 4.2, it seems to be a built in feature now:
UPDATE: Also, nowadays a Python example would probably be preferred to Perl. This works in both Python 2 and 3: