I'm trying to find duplicates in files, by using:
sort myfile | uniq -d
I noticed that uniq seems to dislike Japanese characters for some reason. For example, if I have a file:
あい
いあ
Then
sort myfile | uniq -d
Prints
あい
Why is this? Some kind of locale problem?
Edit: this question was marked as a duplicate. While the underlying problem (strcoll) is the same, this question is fundamentally different. Also, the accepted answer to that question isn't the same as the answer to this question, which is to change locale to C.
Best Answer
Yes, if the locale is en_US.utf8 (as one example), both strings seem equal:
If, however, the language is changed to ja_JP, all seems to work correctly:
It is interesting to note that (in this case) a C locale also work:
That only goes to show that en_US is missing the collate order for some code points.