Why are wc -m and wc -c different

character encodinglocalewc

As a C programmer, I was surprised to see that wc -c (which count the number of bytes), and wc -m (which counts the number of characters) output very different results for a long, text file of mine. I had always been told that sizeof(char) is 1 byte.

qdii@nomada ~/Documents $ wc -c sentences.csv
102990983 sentences.csv
qdii@nomada ~/Documents $ wc -m sentences.csv
89023123 sentences.csv

Any explanation?

Best Answer

The char type in C is one byte, but it's intended for ASCII characters; there are variable-width encodings like UTF-8 that can take up many bytes per character. wc uses the mbrtowc(3) function to decode multibyte sequences, depending on the locale set by the LC_CTYPE environment variable. If you set the locale properly, you should get the same result for all cases. For example:

qdii@nomada ~/Documents $ LC_CTYPE="C" wc -m sentences.csv
102990983 sentences.csv
Related Question