Why doesn’t the Perl play nice with Unicode

perlunicode

On my new Arch installation, perl doesn't seem to play nice with Unicode. For example, given this input file:

ελα ρε
王小红

This command should give me the last two characters of each line:

$ perl -CIO -pe 's/.*(..)$/$1/' file
Îµ
º¢

However, as you can see above, I get gibberish. The correct output is:

ρε
小红

I know that my terminal (gnome-terminator) supports UTF-8 since these both work as expected:

$ cat file
ελα ρε
王小红
$ perl -pe '' file
ελα ρε
王小红

Unfortunately, without -CIO, perl doesn't deal with the files correctly either:

$ perl -pe 's/.*(..)$/$1/' file
ε
��

It also shouldn't be a locale issue:

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

I'm guessing I need to install some Perl packages, but I don't know which ones. Some relevant information:

$ perl --version | grep subversion
This is perl 5, version 22, subversion 0 (v5.22.0) built for x86_64-linux-thread-multi

$ pacman -Qs unicode
local/fribidi 0.19.7-1
    A Free Implementation of the Unicode Bidirectional Algorithm
local/icu 55.1-1
    International Components for Unicode library
local/libunistring 0.9.6-1
    Library for manipulating Unicode strings and C strings
local/perl 5.22.0-1 (base)
    A highly capable, feature-rich programming language
local/perl-unicode-stringprep 1.105-1
    Preparation of Internationalized Strings (RFC 3454)
local/perl-unicode-utf8simple 1.06-5
    Conversions to/from UTF8 from/to characterse
local/ttf-arphic-uming 0.2.20080216.1-5
    CJK Unicode font Ming style

How can I get my perl installation to play nice with Unicode?

Best Answer

The issue you are describing is standard behaviour on the systems I tested on. I and O affect stdin and stdout, so this should work:

→ cat data | perl -CIO -pe 's/.*(..)$/$1/'
ρε
小红

Whereas this might not:

→ perl -CIO -pe 's/.*(..)$/$1/' data
Îµ
º¢

There are two more options to perl -C that produce your desired behaviour.

i     8   UTF-8 is the default PerlIO layer for input streams
o    16   UTF-8 is the default PerlIO layer for output streams

Which is basically saying to perl, use a file open form:

open(F, "<:utf8", "data");

or you can use perl -CSD which is shorthand for perl -CIOEio

S     7   I + O + E
D    24   i + o

Then you get

→ perl -CSD -pe 's/.*(..)$/$1/' data
ρε
小红

If the PERLIO environment variable is set and includes :utf8 this behaviour would also be enabled.

It looks like the default behaviour for perl isn't modifiable at configure/compile time either (cuonglm comment below). Arch certainly doesn't set anything. I doubt debian perl packages would modify default behaviour.

Related Solutions

GNU Screen – Fixing Unicode Character Echo Issues

It's a apparently a known bug: No characters beyond the BMP are displayed, as screen apparently only has a two byte buffer for characters.

(It works in tmux).

XTerm doesn’t display some unicode characters properly

short: xterm uses a single font (except for the special cases of double-width characters), while the other terminals use additional fonts (and they use those fonts for the characters not found in your requested font).

long: the character you are interested in is not part of the font, which appears to be something like fonts-hack-tty in Debian. The missing code is 0x2937, which you can see using xfd -fa hack is not supplied by the font (hint: the first on the page is 0x2987):

The short description of the font gives its intended use:

No frills. No gimmicks. Hack is hand groomed and optically balanced to be a workhorse face for code.

which (since "code" generally is the POSIX character set, plus whatever people think makes good comments) is likely to be small. This example has more non-POSIX characters than the usual. Starting with the ASCII+Latin1:

there are a few hundred glyphs in the font (another dozen screenshots would be needed to show these, though more than half show a small number of glyphs). The second page for instance is partly supported:

Prompted by a comment, I traced gnome-terminal to see that it loads these font files:

/usr/share/fonts/truetype/ttf-bitstream-vera/VeraMono.ttf
/usr/share/fonts/truetype/ttf-bitstream-vera/VeraMoBd.ttf
/usr/share/fonts/truetype/ttf-bitstream-vera/VeraSeBd.ttf
/usr/share/fonts/truetype/ttf-dejavu/DejaVuSansMono.ttf
/usr/share/fonts/truetype/ttf-dejavu/DejaVuSerif.ttf

and that 0x2937 is supplied by the last one. The actual details may differ on your configuration.

Best Answer

Related Solutions

GNU Screen – Fixing Unicode Character Echo Issues

XTerm doesn’t display *some* unicode characters properly

Related Question

XTerm doesn’t display some unicode characters properly