The file
command makes "best-guesses" about the encoding. Use the -i
parameter to force file
to print information about the encoding.
Demonstration:
$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-utf16.txt: text/plain; charset=utf-16le
umlaut-utf8.txt: text/plain; charset=utf-8
Here is how I created the files:
$ echo ä > umlaut-utf8.txt
Nowadays everything is utf-8. But convince yourself:
$ hexdump -C umlaut-utf8.txt
00000000 c3 a4 0a |...|
00000003
Compare with https://en.wikipedia.org/wiki/Ä#Computer_encoding
Convert to the other encodings:
$ iconv -f utf8 -t iso88591 umlaut-utf8.txt > umlaut-iso88591.txt
$ iconv -f utf8 -t utf16 umlaut-utf8.txt > umlaut-utf16.txt
Check the hex dump:
$ hexdump -C umlaut-iso88591.txt
00000000 e4 0a |..|
00000002
$ hexdump -C umlaut-utf16.txt
00000000 ff fe e4 00 0a 00 |......|
00000006
Create something "invalid" by mixing all three:
$ cat umlaut-iso88591.txt umlaut-utf8.txt umlaut-utf16.txt > umlaut-mixed.txt
What file
says:
$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-mixed.txt: application/octet-stream; charset=binary
umlaut-utf16.txt: text/plain; charset=utf-16le
umlaut-utf8.txt: text/plain; charset=utf-8
without -i
:
$ file *
umlaut-iso88591.txt: ISO-8859 text
umlaut-mixed.txt: data
umlaut-utf16.txt: Little-endian UTF-16 Unicode text, with no line terminators
umlaut-utf8.txt: UTF-8 Unicode text
The file
command has no idea of "valid" or "invalid". It just sees some bytes and tries to guess what the encoding might be. As humans we might be able to recognize that a file is a text file with some umlauts in a "wrong" encoding. But as a computer it would need some sort of artificial intelligence.
One might argue that the heuristics of file
is some sort of artificial intelligence. Yet, even if it is, it is a very limited one.
Here is more information about the file
command: http://www.linfo.org/file_command.html
You haven't said which cut
you're using, but since you've mentioned the GNU long option --characters
I'll assume it's that one. In that case, note this passage from info coreutils 'cut invocation'
:
‘-c character-list’
‘--characters=character-list’
Select for printing only the characters in positions listed in character-list. The same as -b
for now, but internationalization will change that.
(emphasis added)
For the moment, GNU cut
always works in terms of single-byte "characters", so the behaviour you see is expected.
Supporting both the -b
and -c
options is required by POSIX — they weren't added to GNU cut
because it had multi-byte support and they worked properly, but to avoid giving errors on POSIX-compliant input. The same -c
has been done in some other cut
implementations, although not FreeBSD's and OS X's at least.
This is the historic behaviour of -c
. -b
was newly added to take over the byte role so that -c
can work with multi-byte characters. Maybe in a few years it will work as desired consistently, although progress hasn't exactly been quick (it's been over a decade already). GNU cut
doesn't even implement the -n
option yet, even though it is orthogonal and intended to help the transition. There are potential compatibility problems with old scripts, which may be a concern, although I don't know definitively what the reason is.
Best Answer
GNU fold and GNU fmt only understand bytes, not characters. To wrap to a certain number of characters, you can use sed.
If you wanted to break at whitespace (useful for many languages), here's a quick-and-dirty awk script.
In any case make sure that your locale settings are correct. Specifically,
LC_CTYPE
must designate the right character encoding, e.g.LC_CTYPE=en_US.utf8
orLC_CTYPE=zh_CN.utf8
(any language code that's available on your system will do) for Unicode encoded as UTF-8.Note that this counts characters, not screen width. Even fixed-width fonts can have double-width characters and this is typically done for Chinese characters, so e.g. a character width of 20 for the text above occupies 40 columns on typical terminals.