Text Processing – Why is uniq Ignoring Unicode and Single Letter Lines?

localesorttext processingunicodeuniq

I'm trying to combine both the American and British dictionaries into one large dictionaries, and I'm trying to remove all the duplicates from the superset, but it seems that uniq is not outputting words like, "épée" or single letters.

This is what I've tried using:

LC_COLLATE=en_US.UTF-8 cat american-english british-english |sort|uniq -u > unique_sorted_combined_dict

If I just do this:

LC_COLLATE=en_US.UTF-8 cat american-english british-english |sort > sorted_combined_dict

"épée" and other such words do show up, along with single letters.

Is there something I'm missing here with uniq?

I should note that I'm using uniq from the GNU coreutils on Ubuntu 12.10, if that makes any difference.

Best Answer

You're setting LC_COLLATE for the cat command only (which doesn't make use of it), while you need to set it for sort and uniq.

Also, you may need to set LC_CTYPE to something utf-8, otherwise it will cause confusion. I would set LC_ALL to en_US.UTF-8

uniq -u only reports unique lines. So, if those single letter words all appear several times, it's normal that they don't show up.

On my system, épée does appear twice:

$ cat american-english british-english | sort | grep -x 'épée'
épée
épée

Maybe you meant sort | uniq or sort -u.

Related Question