Shell – Is it possible to use split to make character chunks out of Chinese unicode bytes

shellsplittext processingunicode

For a while, I've been dealing with Chinese unicode text. Of course, the usual rules apply. I can grep for characters the same way I'd do so for words. This is very useful to me.

But there's one thing I haven't figured out yet. And I don't know if it's even possible.

It stands to reason that CJK would not be amenable to all kinds of splitting. But line splitting works, of course, using split -l.

What I want to do is be able to split an arbitrary number of characters, though.

My understanding of Chinese unicode is that every glyph is the same number of bytes in size. As such, there should be some magic number of bytes, a least common multiple, which would allow me to use split -b, right?

I used trial and error once, hoping to arrive at that number, but was not able to do so. Instead, the characters themselves were split, such that splitting CJK a file in two.

For instance, given a file called 'dunting' which contains only the string 洞庭湖, using split ends up yielding what is essentially nonsense. One of the characters even becomes 溭 during the split

Best Answer

Each character is three bytes wide, as shown in this xxd output:

$ xxd chinese-bytes
0000000: e6b4 9ee5 baad e6b9 96                   .........

split -b3 works for me.

$ split -b3 chinese-bytes
$ echo xa?
xaa xab xac
$ cat xaa; echo
洞
$ cat xab; echo
庭
$ cat xac; echo
湖