BASH print question (printf \\$(printf ‘%03o’ $1))

asciibashlinuxprintf

I used the following to convert from int to char and char to int in bash.
But I do not understand how printf \\$(printf '%03o' $1) or printf '%d' "'$1" work. Please explain how printf \\$(printf '%03o' $1) and printf '%d' work.

#!/bin/bash
# chr() - converts decimal value to its ASCII character representation
# ord() - converts ASCII character to its decimal value

chr() {
  printf \\$(printf '%03o' $1)
}

ord() {
  printf '%d' "'$1"
}

ord A
echo
chr 65
echo

Best Answer

printf '\101' where 101 is a an octal number outputs the byte with that value.

When sent to an ASCII terminal, that will be rendered as A as A is character 65 (octal 101) in ASCII and all ASCII-compatible character sets (which includes most modern charsets with the exception of the EBCDIC ones still used on some IBM systems).

In

printf \\$(printf '%03o' $1)

Which should have been written:

printf "\\$(printf '%03o' "$1")"

as leaving parameter expansions (like $1), or command substitution ($(...)) unquoted is the split+glob operator in Bourne-like shells which is not wanted here

  • printf '%03o' "$1" converts the number in $1 to a 3 digit octal
  • printf "\\$(...)" appends that octal to a \ (\\ inside double quotes becomes \) and passes that to printf so it will output the corresponding byte value.

Note that it only works in locales where the charset is one byte per character (like iso8859-1) or, in locales with a multi-byte charset, only for values 0 to 127.

In bash,

printf '%d\n' "'A"

prints the Unicode code-point of character A (or at least the value returned by mbtowc() which on GNU systems at least is the Unicode code-point).

Some other implementations (including the standalone GNU printf utility) instead return the value of the first byte of the character.

For ASCII characters like A and on ASCII-based systems, that doesn't make any difference, but for others it matters. For instance the Greek α character (U+03B1) is encoded as:

  • byte 225 in iso8859-7 (the standard Greek single-byte charset)
  • bytes 206 177 in UTF-8 (the most commonly used encoding of Unicode on Unix-like systems)
  • bytes 166 193 in GB18030 (the official Chinese encoding of Unicode).

Bash's printf '%d\n' "'α" will always output 945 (0x03b1 in hexadecimal), which is the Unicode code point of α regardless of the locale (at least on GNU systems), but others may return 225, 206 or 166 depending on the locale.

You can see from that those chr and ord are only the reverse of each other for ASCII characters (or values 0 to 127), or in locales using the iso8859-1 character set for all characters (values 0 to 255).

If ord() is meant to return the Unicode code point, then the reverse (print the character corresponding to a Unicode code point) would be:

chr() {
  printf "\U$(printf %08X "$1")"
}

(assuming bash 4.3 or above (\UXXXXXXXX was added in 4.2, but didn't work properly for characters U+0080 to U+00FF until 4.3)).

Then, in any locale:

$ ord α
945
$ chr 945
α

Or for ord() to return the values of the bytes of the encoding of a given character (in the current locale):

ord() {
  printf %s "$1" | od -An -vtu1
}

And for chr() to output those bytes:

chr() {
  printf "$(printf '\\%o' "$@")"
}

Then, in a UTF-8 locale for instance:

$ ord α
 206 177
$ chr 206 177
α

(your ord α would give 945, your chr would give garbage for both chr 945 and chr 206 177).

Or in a locale using iso8859-7:

$ ord α
 225
$ chr 225
α

(your ord α would give 945, though could give 225 if printf was replaced with /usr/bin/printf if on a GNU system).

Related Question