Shell – How to convert an emoticon specified by a U+xxxxx code to utf-8

character encodingshellunicode

Emoticons seem to be specified using a format of U+xxxxx
wherein each x is a hexadecimal digit.

For example, U+1F615 is the official Unicode Consortium code for the "confused face" ?

As I am often confused, I have a strong affinity for this symbol.

The U+1F615 representation is confusing to me because I thought the only encodings possible for unicode characters required 8, 16, 24 or 32 bits, whereas 5 hex digits require 5×4=20 bits.

I've discovered that this symbol seems to be represented by a completely different hex string in bash:

$echo -n ? | hexdump
0000000 f0 9f 98 95                                    
0000004

$echo -e "\xf0\x9f\x98\x95"
?

$PS1=$'\xf0\x9f\x98\x95  >'
?  >

I would have expected U+1F615 to convert to something like \x00 \x01 \xF6 \x15.

I don't see the relationship between these 2 encodings?

When I lookup a symbol in the official Unicode Consortium list, I would like to be able to use that code directly without having to manually convert it in this tedious fashion. i.e.

finding the symbol on some web page
copying it to the clipboard of the web browser
pasting it in bash to echo through a hexdump to discover the REAL code.

Can I use this 20-bit code to determine what the 32-bit code is?

Does a relationship exist between these 2 numbers?

Best Answer

UTF-8 is a variable length encoding of Unicode. It is designed to be superset of ASCII. See Wikipedia for details of the encoding. \x00 \x01 \xF6 \x15 would be UCS-4BE or UTF-32BE encoding.

To get from the Unicode code point to the UTF-8 encoding, assuming the locale's charmap is UTF-8 (see the output of locale charmap), it's just:

$ printf '\U1F615\n'
?
$ echo -e '\U1F615'
?
$ confused_face=$'\U1F615'

The latter will be in the next version of the POSIX standard.

AFAIK, that syntax was introduced in 2000 by the stand-alone GNU printf utility (as opposed to the printf utility of the GNU shell), brought to echo/printf/$'...' builtins first by zsh in 2003, ksh93 in 2004, bash in 2010 (though not working properly there until 2014), but was obviously inspired by other languages.

ksh93 also supports it as printf '\x1f615\n' and printf '\u{1f615}\n'.

$'\uXXXX' and $'\UXXXXXXXX' are supported by zsh, bash, ksh93, mksh and FreeBSD sh, GNU printf, GNU echo.

Some require all the digits (as in \U0001F615 as opposed to \U1F615) though that's likely to change in future versions as POSIX will allow fewer digits. In any case, you need all the digits if the \UXXXXXXXX is to be followed by hexadecimal digits as in \U0001F615FOX, as \U1F615FOX would have been $'\U001F615F'OX.

Some expand to the characters in the current locale's encoding at the time the string is parsed or at the time it is expanded, some only in UTF-8 regardless of the locale. If the character is not available in the current locale's encoding, the behaviour varies between shells.

So, for best portability, best is to only use it in UTF-8 locales and use all the digits, and use it in $'...':

printf '%s\n' $'\U0001F615'

Note that:

LC_ALL=C.UTF-8; printf '%s\n' $'\U0001F615'

or:

{
  LC_ALL=C.UTF-8
  printf '%s\n' $'\U0001F615'
}

Will not work with all shells (including bash) because the $'\U0001F615' is parsed before LC_ALL is assigned. (also note that there's no guarantee that a system will have a locale called C.UTF-8)

You'd need:

LC_ALL=C.UTF-8; eval "confused_face=$'\U0001F615'"

Or:

LC_ALL=C.UTF-8
printf '%s\n' $'\U0001F615'

(not within a compound command or function).

For the reverse, to get from the UTF-8 encoding to the Unicode code-point, see this other question or that one.

$ unicode ? 
U+1F615 CONFUSED FACE
UTF-8: f0 9f 98 95  UTF-16BE: d83dde15  Decimal: &#128533;
?
Category: So (Symbol, Other)
Bidi: ON (Other Neutrals)

$ perl -CA -le 'printf "%x\n", ord shift' ?
1f615

Related Solutions

How to convert to HTML code

The perl CGI module has a escapeHTML function that makes it pretty easy:

perl -e 'use CGI qw(escapeHTML); print escapeHTML("<hi>\n");'

Or to do an entire file:

perl -p -e 'BEGIN { use CGI qw(escapeHTML); } $_ = escapeHTML($_);'  FILENAME

How to set VIM’s default encoding to UTF-8

When Vim reads an existing file, it tries to detect the file encoding. When writing out the file, Vim uses the file encoding that it detected (except when you tell it differently). So a file detected as UTF-8 is written as UTF-8, a file detected as Latin-1 is written as Latin-1, and so on.

By default, the detection process is crude. Every file that you open with Vim will be assumed to be Latin-1, unless it detects a Unicode byte-order mark at the top. A UTF-8 file without a byte-order mark will be hard to edit because any multibyte characters will be shown in the buffer as character sequences instead of single characters.

Worse, Vim, by default, uses Latin-1 to represent the text in the buffer. So a UTF-8 file with a byte-order mark will be corrupted by down-conversion to Latin-1.

The solution is to configure Vim to use UTF-8 internally. This is, in fact, recommended in the Vim documentation, and the only reason it is not configured that way out of the box is to avoid creating enormous confusion among users who expect Vim to operate basically as a Latin-1 editor.

In your .vimrc, add set encoding=utf-8 and restart Vim.

Or instead, set the LANG environment variable to indicate that UTF-8 is your preferred character encoding. This will affect not just Vim but any software which relies on LANG to determine how it should represent text. For example, to indicate that text should appear in English (en), as spoken in the United States (US), encoded as UTF-8 (utf-8), set LANG=en_US.utf-8.

Now Vim will use UTF-8 to represent the text in the buffer. Plus, it will also make a more determined effort to detect the UTF-8 encoding in a file. Besides looking for a byte-order mark, it will also check for UTF-8 without a byte-order mark before falling back to Latin-1. So it will no longer corrupt a file coded in UTF-8, and it should properly display the UTF-8 characters during the editing session.

For more information on how Vim detects the file encoding, see the fileencodings option in the Vim documentation.

For more information on setting the encoding that Vim uses internally, see the encoding option.

If you need to override the encoding used when writing a file back to disk, see the fileencoding option.

Best Answer

Related Solutions

How to convert to HTML code

How to set VIM’s default encoding to UTF-8

Related Question