Bash – Why does uniq think あい and いあ are the same

bashsortuniq

I'm trying to find duplicates in files, by using:

sort myfile | uniq -d

I noticed that uniq seems to dislike Japanese characters for some reason. For example, if I have a file:

あい
いあ

Then

sort myfile | uniq -d

Prints

あい

Why is this? Some kind of locale problem?

Edit: this question was marked as a duplicate. While the underlying problem (strcoll) is the same, this question is fundamentally different. Also, the accepted answer to that question isn't the same as the answer to this question, which is to change locale to C.

Best Answer

Yes, if the locale is en_US.utf8 (as one example), both strings seem equal:

$ printf "%s\n" "いあ" "あい" "いあ" "あい"
いあ
あい
いあ
あい

$ LC_COLLATE=en_US.utf8 bash -c '
    printf "%s\n" "いあ" "あい" "いあ" "あい" |
    sort | 
    uniq '
いあ

If, however, the language is changed to ja_JP, all seems to work correctly:

$ LC_COLLATE=ja_JP.utf8 bash -c '
    printf "%s\n" "いあ" "あい" "いあ" "あい" | 
    sort | 
    uniq '
あい
いあ

It is interesting to note that (in this case) a C locale also work:

$ LC_COLLATE=C bash -c '
    printf "%s\n" "いあ" "あい" "いあ" "あい" |
    sort |
    uniq '
あい
いあ

That only goes to show that en_US is missing the collate order for some code points.

Related Solutions

Sort and Uniq in Awk – How to Use

To sort you can use a pipe also inside of an awk command, as in:

awk '{ print ... | "sort ..." }'

The syntax means that all respective lines of the data file will be passed to the same instance of sort.

Of course you can also do that equivalently on shell level:

awk '{ print ... }' | sort ...

Or you can use GNU awk which has a couple sort functions natively defined.

The uniq is in awk typically accomplished by saving the "unique data element or key" in an associative array and checking whether new data need to be memorized. One example to illustrate:

awk '!a[$0]++'

This means: If the current line is not in the array then the condition is true and the default action to print the line triggered. Subsequent lines with the same data will result in a false condition and the data will not be printed.

Sort – Fix Unexpected Sort Order in en_US.UTF-8 Locale

Sorting is done in multiple passes. Each character has three (or sometimes more) weights assigned to it. Let's say for this example the weights are

         wt#1 wt#2 wt#3
space = [0000.0020.0002]
A     = [1BC2.0020.0008]

To create the sort key, the nonzero weights of the characters of a string are concatenated, one weight level at a time. That is, if a weight is zero, no corresponding weight is added (as can be seen at the beginning for " A"). So

       wt#1   -- wt#2 ---   -- wt#3 ---
" A" = 1BC2   0020   0020   0002   0008
       A      sp     A      sp     A

       wt#1   wt#2   wt#3
"A"  = 1BC2   0020   0008
       A      A      A

       wt#1   -- wt#2 ---   -- wt#3 ---
"A " = 1BC2   0020   0020   0008   0002
       A      A      sp     A      sp

If you sort these arrays you get the order you see:

       1BC2   0020   0008               => "A"
       1BC2   0020   0020   0002   0008 => " A"
       1BC2   0020   0020   0008   0002 => "A "

This is a simplification of what actually happens; see the Unicode Collation Algorithm for more details. The above example weights are actually from the standard table, with some details omitted.