Bash – printf: multibyte characters

bashcharacter encodingprintf

When trying to format printf output involving strings containing multi-byte characters, it became clear that printf does not count literal characters but the number of bytes, which makes formatting text difficult if single-byte and multi-byte characters are mixed. For example:

$ cat script
#!/bin/bash
declare -a a b
a+=("0")
a+=("00")
a+=("000")
a+=("0000")
a+=("00000")
b+=("0")
b+=("├─00")
b+=("├─000")
b+=("├─0000")
b+=("└─00000")
printf "%-15s|\n" "${a[@]}" "${b[@]}"

$ ./script
0              |
00             |
000            |
0000           |
00000          |
0              |
├─00       |
├─000      |
├─0000     |
└─00000    |

I found various suggested work-arounds (mainly wrappers using another language or utility to print the text). Are there any native bash solutions? None of the documented printf format strings appear to help. Would the locale settings be relevant in this situation, e.g., to use a fixed-width character encoding like UTF-32?

Best Answer

You could work around it by telling the terminal to move the cursor to the desired position, instead of having printf count the characters.:

$ printf "%s\033[10G-\n" "abc" "├─cd" "└──ef"
abc      -
├─cd     -
└──ef    -

Well, assuming you're printing to a terminal, that is...

The control sequence there is <ESC>[nnG where nn is the column to move to, in decimal.

Of course, if the first column is longer than the allocated space, the result isn't too nice:

$ printf "%s\033[10G-\n" "abcdefghijkl"
abcdefghi-kl

To work around that, you could explicitly clear the rest of the line (<ESC>[K) before printing the following column.

$ printf "%s\033[10G\033[K-\n" "abcdefghijkl"
abcdefghi-

Another way would be to do the padding manually, assuming we have something that can determine the length of the string in characters. This seems to work in Bash for simple characters, but is of course a bit ugly. Zero-width and double width characters will probably break it, and I didn't test combining characters either.

#!/bin/bash
pad() { 
    # parameters:
    #  1: name of variable to pad
    #  2: length to pad to
    local string=${!1}
    local len=${#string}
    printf -v "$1" "%s%$(($2 - len))s" "$string" ""
}
echo "1234567890"
for x in "abc" "├─cd" "└──ef" ; do
    pad x 9
    printf "%s-\n" "$x"
done

And the output is:

1234567890
abc      -
├─cd     -
└──ef    -