For a while, I've been dealing with Chinese unicode text. Of course, the usual rules apply. I can grep
for characters the same way I'd do so for words. This is very useful to me.
But there's one thing I haven't figured out yet. And I don't know if it's even possible.
It stands to reason that CJK would not be amenable to all kinds of splitting. But line splitting works, of course, using split -l
.
What I want to do is be able to split
an arbitrary number of characters, though.
My understanding of Chinese unicode is that every glyph is the same number of bytes in size. As such, there should be some magic number of bytes, a least common multiple, which would allow me to use split -b
, right?
I used trial and error once, hoping to arrive at that number, but was not able to do so. Instead, the characters themselves were split, such that splitting CJK a file in two.
For instance, given a file called 'dunting' which contains only the string 洞庭湖, using split
ends up yielding what is essentially nonsense. One of the characters even becomes 溭 during the split
…
Best Answer
Each character is three bytes wide, as shown in this
xxd
output:split -b3
works for me.