Bash – Using printf to Decode Unicode Characters in Arguments

bashprintfunicode

I am trying to printf some unicode codes that I pipe in like this

echo 0024 0025 | xargs -n1 echo # one code per line
  | xargs printf '\u%s\n'

hoping to get this

$
%

but this is what I get

printf: missing hexadecimal number in escape

After some trial and error, I actually have two smaller problems, and one kind-of makes sense and the other seems like a complete mystery.


Problem 1:

printf '\u%s\n' 0024 0025

gives me this

-bash: printf: missing unicode digit for \u
\u0024
-bash: printf: missing unicode digit for \u
\u0025

Problem 2:

> # use built-in for $
> printf '\u0024\n'
$
> # use exe for $
> which printf
/usr/bin/printf
> /usr/bin/printf '\u0024\n'
$
> # now use built-in for %
> printf '\u0025\n'
%
> # but look what happens when we use exe for % !!!!
> /usr/bin/printf '\u0025\n'
/usr/bin/printf: invalid universal character name \u0025

(using > for $ so you can see the $ in the output)

For some reason some characters work with exe version but some don't even though all work with built-in printf.


so here is a work-around that would work if it weren't for problem #2
(but might be quite a bit slower than my original idea)

echo 0024 0025 | xargs -n1 echo # one item per line
  | xargs -I {} printf '\u{}\n'

but due to problem #2, it kind of half works:

$ echo 0024 0025 | xargs -n1 echo | xargs -I {} printf '\u{}\n'
$
printf: invalid universal character name \u0025

($ comes out but % gets error)


So I guess my questions are:

-Is there any way of making printf work with the number code so that I can run printf once instead of once per argument with -I?

-What am I doing wrong that printf built-in doesn't mind, but printf exe doesn't like, but only for % and not for $?

Best Answer

To avoid the double-expansion problem (\u is processed before %s), you can use %b, at least in Bash printf:

printf '%b\n' \\u0024 \\u0025

You can pre-process your input in various ways:

set 0024 0025
printf '%b\n' "${@/#/\\u}"

The standalone printf, as implemented in GNU coreutils, has the following restrictions on Unicode character specifications:

printf interprets two character syntaxes introduced in ISO C 99: ‘\u’ for 16-bit Unicode (ISO/IEC 10646) characters, specified as four hexadecimal digits hhhh, and ‘\U’ for 32-bit Unicode characters, specified as eight hexadecimal digits hhhhhhhh. printf outputs the Unicode characters according to the LC_CTYPE locale. Unicode characters in the ranges U+0000…U+009F, U+D800…U+DFFF cannot be specified by this syntax, except for U+0024 ($), U+0040 (@), and U+0060 (`).

This explains why you can’t produce % in this manner.