Why are wc -m and wc -c different

character encodinglocalewc

As a C programmer, I was surprised to see that wc -c (which count the number of bytes), and wc -m (which counts the number of characters) output very different results for a long, text file of mine. I had always been told that sizeof(char) is 1 byte.

qdii@nomada ~/Documents $ wc -c sentences.csv
102990983 sentences.csv
qdii@nomada ~/Documents $ wc -m sentences.csv
89023123 sentences.csv

Any explanation?

Best Answer

The char type in C is one byte, but it's intended for ASCII characters; there are variable-width encodings like UTF-8 that can take up many bytes per character. wc uses the mbrtowc(3) function to decode multibyte sequences, depending on the locale set by the LC_CTYPE environment variable. If you set the locale properly, you should get the same result for all cases. For example:

qdii@nomada ~/Documents $ LC_CTYPE="C" wc -m sentences.csv
102990983 sentences.csv

Related Solutions

SSH – Working with Filenames in Different Encoding

Inside a terminal emulator that supports UTF-8, you can use the luit command to run a subshell (or other program) in a different locale. The locale setting that indicates character sets is LC_CTYPE.

LC_CTYPE=ru_RU.KOI8-R luit ls   # run one command
LC_CTYPE=ru_RU.KOI8-R luit      # start a shell (type Ctrl+D or exit to return to the parent shell)

If you have a whole tree of files in a different encoding, I recommend (if possible) mounting it through convmvfs.

mkdir ~/net/ivan@example.com.KOI8-R ~/net/ivan@example.com.UTF-8
sshfs ivan@example.com: ~/net/ivan@example.com.KOI8-R
convmvfs -o srcdir=~/net/ivan@example.com.KOI8-R,icharset=KOI8-R,ocharset=UTF-8 ~/net/ivan@example.com.UTF-8
ls ~/net/ivan@example.com.UTF-8

Bash – Why is bash extended-globbing variable substitution acting at the byte level

The short answer to your question is that *(pattern-list) will match zero or more occurrences of the given patterns. There are zero instances of Unicode character 0001 between each of the input bytes. So the replace operation replaces each of those zero instances by a single space.

Maybe you meant to do this:

$ for str in  $'\t' "ab"  ळ ; do  
    printf -- '%s' "${str//+($'\x01')/ }" |xxd
  done)
0000000: 09                                       .
0000000: 6162                                     ab
0000000: e0a4 b3                                  ...

But the longer answer is that in any case, pathnames aren't text. At least, they're not as far as the (Unix-like) operating system is concerned. They are byte sequences. The problem is that things like this are trivial to do:

$ LC_ALL=latin1
$ mkdir 'áñ' && cd 'áñ'
$ LC_ALL=ga_IE.iso885915@euro
$ mkdir '€25' && cd '€25'
$ LC_ALL=zh_TW
$ pwd
# ... what should the output be?  And what about the output of:
$ /bin/pwd

Each of those locales includes characters which don't exist in the others. This problem affects things like locate -r and find -regex too; the argument of locate -r is a regular expression which therefore must include support for things like character classes; but you don't know what locale to use to determine the character classes for the characters in the path names or even if there is a single usable locale which can be used to represent all the paths on the system.

Best Answer

Related Solutions

SSH – Working with Filenames in Different Encoding

Bash – Why is bash extended-globbing variable substitution acting at the byte level

Related Question