Fold and text columns

character encodingtext processing

Can fold be set to recognize characters instead of bytes? Traditional Chinese characters appear to be encoded in three bytes each (in UTF-8 at least), which means that if fold's -w is not a multiple of three, then the following will occur:

$ cat in.txt
【財經中心、政治中心╱台北報導】看不慣政府施政效率緩慢，鴻海集團董事長郭台銘動念選總統！《壹週刊》報導，在川普勝選當晚，郭召集鴻海高層幹部，進行美國總統大選換人後的應變策略演練，讓人驚訝的是，郭詢問在場幹

$ cat in.txt | fold # -w is 80 by default
【財經中心、政治中心╱台北報導】看不慣政府施政效率緩��
�，鴻海集團董事長郭台銘動念選總統！《壹週刊》報導，在�
��普勝選當晚，郭召集鴻海高層幹部，進行美國總統大選換人
後的應變策略演練，讓人驚訝的是，郭詢問在場幹

fold's default output is a width of 80 columns, and this results in 26 2/3 characters (26 * 3 + 2, or 80 bytes) being printed on each line. Therefore, -w must be set to a multiple of three in order to avoid character breakage. So, at least for fold, columns=bytes. Again, my question is, can fold can be set to honor multi-byte characters? The man page doesn't mention anything about this.

Best Answer

GNU fold and GNU fmt only understand bytes, not characters. To wrap to a certain number of characters, you can use sed.

sed 's/.\{20\}/&\n/g' <in.txt
【財經中心、政治中心╱台北報導】看不慣政
府施政效率緩慢，鴻海集團董事長郭台銘動念
選總統！《壹週刊》報導，在川普勝選當晚，
郭召集鴻海高層幹部，進行美國總統大選換人
後的應變策略演練，讓人驚訝的是，郭詢問在
場幹

If you wanted to break at whitespace (useful for many languages), here's a quick-and-dirty awk script.

awk '
    BEGIN {width = 20}
    NF == 0 {column = 0; print}
    {
        split($0, a);
        for (i in a) {
            w = length(a[i]) + 1;
            column += w;
            if (column > width) {column = w; print ""};
            if (column != w) printf " ";
            printf "%s", a[i];
        }
    }
    END {if (column) print ""}'

In any case make sure that your locale settings are correct. Specifically, LC_CTYPE must designate the right character encoding, e.g. LC_CTYPE=en_US.utf8 or LC_CTYPE=zh_CN.utf8 (any language code that's available on your system will do) for Unicode encoded as UTF-8.

Note that this counts characters, not screen width. Even fixed-width fonts can have double-width characters and this is typically done for Chinese characters, so e.g. a character width of 20 for the text above occupies 40 columns on typical terminals.

Related Solutions

How to test the encoding of a text file… Is it valid, and what is it

The file command makes "best-guesses" about the encoding. Use the -i parameter to force file to print information about the encoding.

Demonstration:

$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-utf16.txt:    text/plain; charset=utf-16le
umlaut-utf8.txt:     text/plain; charset=utf-8

Here is how I created the files:

$ echo ä > umlaut-utf8.txt

Nowadays everything is utf-8. But convince yourself:

$ hexdump -C umlaut-utf8.txt 
00000000  c3 a4 0a                                          |...|
00000003

Compare with https://en.wikipedia.org/wiki/Ä#Computer_encoding

Convert to the other encodings:

$ iconv -f utf8 -t iso88591 umlaut-utf8.txt > umlaut-iso88591.txt 
$ iconv -f utf8 -t utf16 umlaut-utf8.txt > umlaut-utf16.txt

Check the hex dump:

$ hexdump -C umlaut-iso88591.txt 
00000000  e4 0a                                             |..|
00000002
$ hexdump -C umlaut-utf16.txt 
00000000  ff fe e4 00 0a 00                                 |......|
00000006

Create something "invalid" by mixing all three:

$ cat umlaut-iso88591.txt umlaut-utf8.txt umlaut-utf16.txt > umlaut-mixed.txt

What file says:

$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-mixed.txt:    application/octet-stream; charset=binary
umlaut-utf16.txt:    text/plain; charset=utf-16le
umlaut-utf8.txt:     text/plain; charset=utf-8

without -i:

$ file *
umlaut-iso88591.txt: ISO-8859 text
umlaut-mixed.txt:    data
umlaut-utf16.txt:    Little-endian UTF-16 Unicode text, with no line terminators
umlaut-utf8.txt:     UTF-8 Unicode text

The file command has no idea of "valid" or "invalid". It just sees some bytes and tries to guess what the encoding might be. As humans we might be able to recognize that a file is a text file with some umlauts in a "wrong" encoding. But as a computer it would need some sort of artificial intelligence.

One might argue that the heuristics of file is some sort of artificial intelligence. Yet, even if it is, it is a very limited one.

Here is more information about the file command: http://www.linfo.org/file_command.html

UTF-8 – Can Not Use `cut -c` with UTF-8 Characters?

You haven't said which cut you're using, but since you've mentioned the GNU long option --characters I'll assume it's that one. In that case, note this passage from info coreutils 'cut invocation':

‘-c character-list’
‘--characters=character-list’
Select for printing only the characters in positions listed in character-list. The same as -b for now, but internationalization will change that.

(emphasis added)

For the moment, GNU cut always works in terms of single-byte "characters", so the behaviour you see is expected.

Supporting both the -b and -c options is required by POSIX — they weren't added to GNU cut because it had multi-byte support and they worked properly, but to avoid giving errors on POSIX-compliant input. The same -c has been done in some other cut implementations, although not FreeBSD's and OS X's at least.

This is the historic behaviour of -c. -b was newly added to take over the byte role so that -c can work with multi-byte characters. Maybe in a few years it will work as desired consistently, although progress hasn't exactly been quick (it's been over a decade already). GNU cut doesn't even implement the -n option yet, even though it is orthogonal and intended to help the transition. There are potential compatibility problems with old scripts, which may be a concern, although I don't know definitively what the reason is.

Best Answer

Related Solutions

How to test the encoding of a text file… Is it valid, and what is it

UTF-8 – Can Not Use `cut -c` with UTF-8 Characters?

Related Question