Fold and text columns

character encodingtext processing

Can fold be set to recognize characters instead of bytes? Traditional Chinese characters appear to be encoded in three bytes each (in UTF-8 at least), which means that if fold's -w is not a multiple of three, then the following will occur:

$ cat in.txt
【財經中心、政治中心╱台北報導】看不慣政府施政效率緩慢,鴻海集團董事長郭台銘動念選總統!《壹週刊》報導,在川普勝選當晚,郭召集鴻海高層幹部,進行美國總統大選換人後的應變策略演練,讓人驚訝的是,郭詢問在場幹

$ cat in.txt | fold # -w is 80 by default
【財經中心、政治中心╱台北報導】看不慣政府施政效率緩��
�,鴻海集團董事長郭台銘動念選總統!《壹週刊》報導,在�
��普勝選當晚,郭召集鴻海高層幹部,進行美國總統大選換人
後的應變策略演練,讓人驚訝的是,郭詢問在場幹

fold's default output is a width of 80 columns, and this results in 26 2/3 characters (26 * 3 + 2, or 80 bytes) being printed on each line. Therefore, -w must be set to a multiple of three in order to avoid character breakage. So, at least for fold, columns=bytes. Again, my question is, can fold can be set to honor multi-byte characters? The man page doesn't mention anything about this.

Best Answer

GNU fold and GNU fmt only understand bytes, not characters. To wrap to a certain number of characters, you can use sed.

sed 's/.\{20\}/&\n/g' <in.txt
【財經中心、政治中心╱台北報導】看不慣政
府施政效率緩慢,鴻海集團董事長郭台銘動念
選總統!《壹週刊》報導,在川普勝選當晚,
郭召集鴻海高層幹部,進行美國總統大選換人
後的應變策略演練,讓人驚訝的是,郭詢問在
場幹

If you wanted to break at whitespace (useful for many languages), here's a quick-and-dirty awk script.

awk '
    BEGIN {width = 20}
    NF == 0 {column = 0; print}
    {
        split($0, a);
        for (i in a) {
            w = length(a[i]) + 1;
            column += w;
            if (column > width) {column = w; print ""};
            if (column != w) printf " ";
            printf "%s", a[i];
        }
    }
    END {if (column) print ""}'

In any case make sure that your locale settings are correct. Specifically, LC_CTYPE must designate the right character encoding, e.g. LC_CTYPE=en_US.utf8 or LC_CTYPE=zh_CN.utf8 (any language code that's available on your system will do) for Unicode encoded as UTF-8.

Note that this counts characters, not screen width. Even fixed-width fonts can have double-width characters and this is typically done for Chinese characters, so e.g. a character width of 20 for the text above occupies 40 columns on typical terminals.

Related Question