How to correctly display Polish diacritic symbols in groff

groffpdfunicode

I'm playing around with groff and I wanted to generate a pdf from the following test.ms

.TL
Tytuł
.AU
Imię Nazwisko
.NH
Wstęp
.PP
Pierwszy paragraf. Jakieś informacje, żeby były polskie znaki.
.PP
Drugi paragraf. Reszta znaków:

ąęćłńśóżźĄĘĆŁŃŚÓŻŹ
.NH
Bla bla bla
.PP
safsdsdfsasdds

As you can see it contains Polish diacritic symbols. After compiling it with groff -ms test.ms -T pdf > test.pdf we are presented with this mess:
Horrible!

My first guess was recompiling with utf-8 support.

$ groff -Kutf8 -ms test.ms -T pdf > test.pdf
test.ms:4: warning: can't find special character `u0065_0328'
test.ms:8: warning: can't find special character `u0073_0301'
test.ms:8: warning: can't find special character `u00A0'
test.ms:8: warning: can't find special character `u007A_0307'
test.ms:12: warning: can't find special character `u0061_0328'
test.ms:12: warning: can't find special character `u006E_0301'
test.ms:12: warning: can't find special character `u007A_0301'
test.ms:12: warning: can't find special character `u0041_0328'
test.ms:12: warning: can't find special character `u0045_0328'
test.ms:12: warning: can't find special character `u004E_0301'
test.ms:12: warning: can't find special character `u0053_0301'
test.ms:12: warning: can't find special character `u005A_0307'
test.ms:12: warning: can't find special character `u005A_0301'

Groff just ignored most of the symbols and the pdf looks like this:

Still awful.

After a bit of googling I've found this:

groff -Kutf8 -Tdvi -mec -ms test.ms > test.dvi
dvipdfm -cz 9 test.dvi

Yeah, it still fails (although it's better, only one character skipped):

$ groff -Kutf8 -Tdvi -mec -ms test.ms > test.dvi
test.ms:8: warning: can't find special character `u00A0'

How can I get this to work?

EDIT: Here's the output of locale

LANG=pl_PL.UTF-8
LANGUAGE=
LC_CTYPE="pl_PL.UTF-8"
LC_NUMERIC="pl_PL.UTF-8"
LC_TIME="pl_PL.UTF-8"
LC_COLLATE="pl_PL.UTF-8"
LC_MONETARY="pl_PL.UTF-8"
LC_MESSAGES="pl_PL.UTF-8"
LC_PAPER="pl_PL.UTF-8"
LC_NAME="pl_PL.UTF-8"
LC_ADDRESS="pl_PL.UTF-8"
LC_TELEPHONE="pl_PL.UTF-8"
LC_MEASUREMENT="pl_PL.UTF-8"
LC_IDENTIFICATION="pl_PL.UTF-8"
LC_ALL=

Best Answer

Character A0 is an unbreakable space. It looks like it is between "Jakieś" and "informacje". Use your editor to replace it by a normal space and you should be good to go.

Advice: I've set up my editors (emacs, vim) to highlight unbreakable spaces because I sometimes unintentionally type some with AltGr+space when I hit space after typing a character requiring to press AltGr.

The warnings after your first guess seem to show that some chars (ę, ś, ż...) are encoded with combining diacritics rather than natively. E.g. ę == e (hex 65) + combining ogonek (hex 328) rather than "e with ogonek" (hex 119). How do you edit your source file? You may use a Compose key to produce "standalone" letters-with-diacritics, e.g. Compose e , for "ę".

Related Question