Bash – Why does uniq think あい and いあ are the same

bashsortuniq

I'm trying to find duplicates in files, by using:

sort myfile | uniq -d

I noticed that uniq seems to dislike Japanese characters for some reason. For example, if I have a file:

あい
いあ

Then

sort myfile | uniq -d 

Prints

あい

Why is this? Some kind of locale problem?

Edit: this question was marked as a duplicate. While the underlying problem (strcoll) is the same, this question is fundamentally different. Also, the accepted answer to that question isn't the same as the answer to this question, which is to change locale to C.

Best Answer

Yes, if the locale is en_US.utf8 (as one example), both strings seem equal:

$ printf "%s\n" "いあ" "あい" "いあ" "あい"
いあ
あい
いあ
あい

$ LC_COLLATE=en_US.utf8 bash -c '
    printf "%s\n" "いあ" "あい" "いあ" "あい" |
    sort | 
    uniq '
いあ

If, however, the language is changed to ja_JP, all seems to work correctly:

$ LC_COLLATE=ja_JP.utf8 bash -c '
    printf "%s\n" "いあ" "あい" "いあ" "あい" | 
    sort | 
    uniq '
あい
いあ

It is interesting to note that (in this case) a C locale also work:

$ LC_COLLATE=C bash -c '
    printf "%s\n" "いあ" "あい" "いあ" "あい" |
    sort |
    uniq '
あい
いあ

That only goes to show that en_US is missing the collate order for some code points.