Bash – Using printf to Decode Unicode Characters in Arguments

bashprintfunicode

I am trying to printf some unicode codes that I pipe in like this

echo 0024 0025 | xargs -n1 echo # one code per line
  | xargs printf '\u%s\n'

hoping to get this

$
%

but this is what I get

printf: missing hexadecimal number in escape

After some trial and error, I actually have two smaller problems, and one kind-of makes sense and the other seems like a complete mystery.

Problem 1:

printf '\u%s\n' 0024 0025

gives me this

-bash: printf: missing unicode digit for \u
\u0024
-bash: printf: missing unicode digit for \u
\u0025

Problem 2:

> # use built-in for $
> printf '\u0024\n'
$
> # use exe for $
> which printf
/usr/bin/printf
> /usr/bin/printf '\u0024\n'
$
> # now use built-in for %
> printf '\u0025\n'
%
> # but look what happens when we use exe for % !!!!
> /usr/bin/printf '\u0025\n'
/usr/bin/printf: invalid universal character name \u0025

(using > for $ so you can see the $ in the output)

For some reason some characters work with exe version but some don't even though all work with built-in printf.

so here is a work-around that would work if it weren't for problem #2
(but might be quite a bit slower than my original idea)

echo 0024 0025 | xargs -n1 echo # one item per line
  | xargs -I {} printf '\u{}\n'

but due to problem #2, it kind of half works:

$ echo 0024 0025 | xargs -n1 echo | xargs -I {} printf '\u{}\n'
$
printf: invalid universal character name \u0025

($ comes out but % gets error)

So I guess my questions are:

-Is there any way of making printf work with the number code so that I can run printf once instead of once per argument with -I?

-What am I doing wrong that printf built-in doesn't mind, but printf exe doesn't like, but only for % and not for $?

Best Answer

To avoid the double-expansion problem (\u is processed before %s), you can use %b, at least in Bash printf:

printf '%b\n' \\u0024 \\u0025

You can pre-process your input in various ways:

set 0024 0025
printf '%b\n' "${@/#/\\u}"

The standalone printf, as implemented in GNU coreutils, has the following restrictions on Unicode character specifications:

printf interprets two character syntaxes introduced in ISO C 99: ‘\u’ for 16-bit Unicode (ISO/IEC 10646) characters, specified as four hexadecimal digits hhhh, and ‘\U’ for 32-bit Unicode characters, specified as eight hexadecimal digits hhhhhhhh. printf outputs the Unicode characters according to the LC_CTYPE locale. Unicode characters in the ranges U+0000…U+009F, U+D800…U+DFFF cannot be specified by this syntax, except for U+0024 ($), U+0040 (@), and U+0060 (`).

This explains why you can’t produce % in this manner.

Related Solutions

Shell – Is it possible to use split to make character chunks out of Chinese unicode bytes

Each character is three bytes wide, as shown in this xxd output:

$ xxd chinese-bytes
0000000: e6b4 9ee5 baad e6b9 96                   .........

split -b3 works for me.

$ split -b3 chinese-bytes
$ echo xa?
xaa xab xac
$ cat xaa; echo
洞
$ cat xab; echo
庭
$ cat xac; echo
湖

Xterm not displaying unicode

Writing in 2016, talking about xterm patch #278 (released in 2012):

xterm uses a single font, rather than font sets which are supported by several other terminals. The pseudo-graphic characters in this (pasted from xterm):

⎛     ⎽⎽⎽⎽⎽⎽⎽   ⎞
⎜    ╱    3     ⎟
⎜   ╱    x      ⎟
⎜  ╱   ───── , 1⎟
⎝╲╱    x + 1    ⎠

are not provided by the TypeType font specified here:

xterm.vt100.faceName: Terminus
xterm.vt100.faceSize: 14

Other terminals, given that font would provide those characters from another font.

The way to make xterm work is

specify a font which does cover all of the characters needed, and
tell it to use UTF-8 encoding.

The latter is addressed for most users by the default setting of the locale resource: xterm will (usually) use UTF-8 encoding. But the default behavior is VT100-compatible, hence the use of ISO-8859-1 compatible fonts.

Terminus uses more glyphs than that, but falls far short of covering all pseudo-graphics in Unicode.
The ones that display as n are U+239B, U+239C, U+239D, U+239E, U+23A0.
The version of Terminus in Debian 7 (and Debian testing) has less than 256 glyphs and happens to show n as described in the question.

That happens because (although xterm knows that the glyphs are missing), it has printed the string using the font, assuming that (like most other fonts) missing entries will be shown as blanks. In this case, the freetype library seems to be mapping the low-order byte of the Unicode values into the range that Terminus supports. That happens to fall in a range that the font displays as n (for "no such character"):

The quick workaround uses the uxterm script, which selects a different font and ensures that UTF-8 encoding is used.

Further reading:

uxterm - X terminal emulator for Unicode (UTF-8) environments
UXTerm.ad (X application resources used for uxterm)
Terminus Font Home Page

Terminus Font is a clean, fixed width bitmap font, designed for long (8 and more hours per day) work with computers. Version 4.40 contains 1241 characters, covers about 120 language sets and supports ISO8859-1/2/5/7/9/13/15/16, Paratype-PT154/PT254, KOI8-R/U/E/F, Esperanto, many IBM, Windows and Macintosh code pages, as well as the IBM VGA, vt100 and xterm pseudographic characters.

The above was talking about xterm patch #278 which was four years old in 2016. Development of xterm is ongoing, and beginning with patch #338 (late 2018) there is support for TrueType fontsets. Here is a screenshot using the OP's resource-settings from xterm patch #342 (#343 will probably be out "soon"):

Using the -report-fonts option, I see that it loaded these font-files (treating bold/italic as the "same" as normal, and using a second font for the special characters):

    file=/usr/share/fonts/X11/misc/ter-u18n\_iso-8859-1.pcf.gz              
    file=/usr/share/fonts/X11/misc/ter-u18b\_iso-8859-1.pcf.gz              
    file=/usr/share/fonts/X11/misc/ter-u18n\_iso-8859-1.pcf.gz              
    file=/usr/share/fonts/truetype/dejavu/DejaVuSansMono.ttf

The actual number of fonts depends on what you want to do. In testing the existing range of Unicode values, it may use a couple of dozen fonts.

Best Answer

Related Solutions

Shell – Is it possible to use split to make character chunks out of Chinese unicode bytes

Xterm not displaying unicode

Related Question