What’s the difference between `-C` and `-c` in `tr` command

command linetr

Today I learnt a bit about tr command.

But I was stuck at understanding the difference between -c and -C.

On the manual, it said:

 -C      Complement the set of characters in string1, that is ``-C ab'' includes every character except for `a' and `b'.

 -c      Same as -C but complement the set of values in string1.

I'm not quite understand what does set of values in string1 of -c option mean.
I thought it may treat string 1 "ab" as a whole and will escape single a and b.
So I did an experiment:

⇒  echo "ab_a_b" | tr -C 'ba' 'c'
abcacbc%                                                                                                                                                                             
⇒  echo "ab_a_b" | tr -c 'ba' 'c'
abcacbc%

Things didn't match my expectation!
So, what's the difference between -C and -c in tr command?


Software Version: BSD 2004 on OSX10.10

Best Answer

The POSIX manual says this:

  • If the -C option is specified, the complements of the characters specified by string1 (the set of all characters in the current character set, as defined by the current setting of LC_CTYPE, except for those actually specified in the string1 operand) shall be placed in the array in ascending collation sequence, as defined by the current setting of LC_COLLATE.

  • If the -c option is specified, the complement of the values specified by string1 shall be placed in the array in ascending order by binary value.

and contains the following note

The ISO POSIX-2:1993 standard had a -c option that behaved similarly to the -C option, but did not supply functionality equivalent to the -c option specified in POSIX.1-2008. This meant that historical practice of being able to specify tr -cd\000-\177 (which would delete all bytes with the top bit set) would have no effect because, in the C locale, bytes with the values octal 200 to octal 377 are not characters.

From this it appears that the -c option let you specify numeric values representing ASCII character instead of using the characters themselves.