What’s the difference between `-C` and `-c` in `tr` command

command linetr

Today I learnt a bit about tr command.

But I was stuck at understanding the difference between -c and -C.

On the manual, it said:

 -C      Complement the set of characters in string1, that is ``-C ab'' includes every character except for `a' and `b'.

 -c      Same as -C but complement the set of values in string1.

I'm not quite understand what does set of values in string1 of -c option mean.
I thought it may treat string 1 "ab" as a whole and will escape single a and b.
So I did an experiment:

⇒  echo "ab_a_b" | tr -C 'ba' 'c'
abcacbc%                                                                                                                                                                             
⇒  echo "ab_a_b" | tr -c 'ba' 'c'
abcacbc%

Things didn't match my expectation!
So, what's the difference between -C and -c in tr command?

Software Version: BSD 2004 on OSX10.10

Best Answer

The POSIX manual says this:

If the -C option is specified, the complements of the characters specified by string1 (the set of all characters in the current character set, as defined by the current setting of LC_CTYPE, except for those actually specified in the string1 operand) shall be placed in the array in ascending collation sequence, as defined by the current setting of LC_COLLATE.

If the -c option is specified, the complement of the values specified by string1 shall be placed in the array in ascending order by binary value.

and contains the following note

The ISO POSIX-2:1993 standard had a -c option that behaved similarly to the -C option, but did not supply functionality equivalent to the -c option specified in POSIX.1-2008. This meant that historical practice of being able to specify tr -cd\000-\177 (which would delete all bytes with the top bit set) would have no effect because, in the C locale, bytes with the values octal 200 to octal 377 are not characters.

From this it appears that the -c option let you specify numeric values representing ASCII character instead of using the characters themselves.

Related Solutions

shell – Difference Between $(stuff) and `stuff`

The old-style backquotes ` ` do treat backslashes and nesting a bit different. The new-style $() interprets everything in between ( ) as a command.

echo $(uname | $(echo cat))
Linux

echo `uname | `echo cat``
bash: command substitution: line 2: syntax error: unexpected end of file
echo cat

works if the nested backquotes are escaped:

echo `uname | \`echo cat\``
Linux

backslash fun:

echo $(echo '\\')
\\

echo `echo '\\'`
\

The new-style $() applies to all POSIX-conformant shells.
As mouviciel pointed out, old-style ` ` might be necessary for older shells.

Apart from the technical point of view, the old-style ` ` has also a visual disadvantage:

Hard to notice: I like $(program) better than `program`
Easily confused with a single quote: '`'`''`''`'`''`'
Not so easy to type (maybe not even on the standard layout of the keyboard)

_{(and SE uses ` ` for own purpose, it was a pain writing this answer :)}

Is the historical Unix V5 tr command padding behavior of set2 different from what we consider today “classic” System V (1983-1988) behavior

The difference is only in the wording of the padding behavior in the V4-V5 manual - but the behavior is the same throughout. As it stands the results of the V5 implementation is identical to that of the System V one, which is itself identical to the GNU tr behavior with the --truncate-set1 option. Furthermore, "truncating set1 to the lenght of set2" gives the same result as "padding string2 with corresponding characters from string1". It means the same thing in practice. Let's demonstrate this.

First, you need not be a developer to try to compile this. Compare the source code with the almost identical PWB/Unix version. You will see the only difference being reliance on the "modern" stdio.h assets basically, so I've stripped the source of its references to inbuf, fout, dup and flush and replaced it with what PWB/Unix does - but this in no way should alter the behavior as the algorithms remain untouched. I've annotated the trivial changes I've made from the original:

#include <stdio.h>    <------ added
int dflag = 0;        <------ added "=" sign to those
int sflag = 0;
int cflag = 0;
int save = 0;
char code[256];
char squeez[256];
char vect[256];
struct string { int last, max, rep; char *p; } string1, string2;
FILE *input;          <------ part of the stdio framework I guess;

main(argc,argv)
char **argv;
{
    int i, j;
    int c, d;
    char *compl;

    string1.last = string2.last = 0;
    string1.max = string2.max = 0;
    string1.rep = string2.rep = 0;
    string1.p = string2.p = "";

    if(--argc>0) {
        argv++;
        if(*argv[0]=='-'&&argv[0][1]!=0) {
            while(*++argv[0])
                switch(*argv[0]) {
                case 'c':
                    cflag++;
                    continue;
                case 'd':
                    dflag++;
                    continue;
                case 's':
                    sflag++;
                    continue;
                }
            argc--;
            argv++;
        }
    }
    if(argc>0) string1.p = argv[0];
    if(argc>1) string2.p = argv[1];
    for(i=0; i<256; i++)
        code[i] = vect[i] = 0;
    if(cflag) {
        while(c = next(&string1))
            vect[c&0377] = 1;
        j = 0;
        for(i=1; i<256; i++)
            if(vect[i]==0) vect[j++] = i;
        vect[j] = 0;
        compl = vect;
    }
    for(i=0; i<256; i++)
        squeez[i] = 0;
    for(;;){
        if(cflag) c = *compl++;
        else c = next(&string1);
        if(c==0) break;
        d = next(&string2);
        if(d==0) d = c;
        code[c&0377] = d;
        squeez[d&0377] = 1;
    }
    while(d = next(&string2))
        squeez[d&0377] = 1;
    squeez[0] = 1;
    for(i=0;i<256;i++) {
        if(code[i]==0) code[i] = i;
        else if(dflag) code[i] = 0;
    }

    input = stdin;                     <------ again stdio
    while((c=getc(input)) != EOF ) {   <------
        if(c == 0) continue;
        if(c = code[c&0377]&0377)
            if(!sflag || c!=save || !squeez[c&0377])
                putchar(save = c);
    }

}

next(s)
struct string *s;
{
    int a, b, c, n;
    int base;

    if(--s->rep > 0) return(s->last);
    if(s->last < s->max) return(++s->last);
    if(*s->p=='[') {
        nextc(s);
        s->last = a = nextc(s);
        s->max = 0;
        switch(nextc(s)) {
        case '-':
            b = nextc(s);
            if(b<a || *s->p++!=']')
                goto error;
            s->max = b;
            return(a);
        case '*':
            base = (*s->p=='0')?8:10;
            n = 0;
            while((c = *s->p)>='0' && c<'0'+base) {
                n = base*n + c - '0';
                s->p++;
            }
            if(*s->p++!=']') goto error;
            if(n==0) n = 1000;
            s->rep = n;
            return(a);
        default:
        error:
            write(1,"Bad string\n",11);
            exit(0);     <------original was exit();
        }
    }
    return(nextc(s));
}

nextc(s)
struct string *s;
{
    int c, i, n;

    c = *s->p++;
    if(c=='\\') {
        i = n = 0;
        while(i<3 && (c = *s->p)>='0' && c<='7') {
            n = n*8 + c - '0';
            i++;
            s->p++;
        }
        if(i>0) c = n;
        else c = *s->p++;
    }
    if(c==0) *--s->p = 0;
    return(c&0377);
}

So cc tr.c compiles:

tr.c: In function ‘next’:
tr.c:118:4: warning: incompatible implicit declaration of built-in function ‘exit’ 
[enabled by default]
exit(0);
^

But a.out is there and works, so let's now compare the padding behavior of the two programs we have:

GNU tr

#tr 0123456789 d     
0123456789 input
dddddddddd output             <----- BSD classic behavior

#tr 0123456789 d123456789     <----- padding set2 with set1 explicitly 
0123456789 i
d123456789 o
01234567890123456789 i
d123456789d123456789 o

#tr -t 0123456789 d           <----- --truncate-set1 i.e. System V behavior
0123456789 i
d123456789 o                  <----- concretely, this is what is meant by a result 
0012 i                               where set2 was padded with set1
dd12 o

#tr -t 0123456789 d123456789  <----- padding set2 with set1 explicitly
0123456789 i                  
d123456789 o                  <----- note this is identical to the last results

Unix V5 tr + stdio mod

#./a.out 0123456789 d         <----- our compiled version with the classic example
0123456789 i
d123456789 o

./a.out 0123456789 d123456789 <----- padding set2 with set1 explicitly
0123456789 i
d123456789 o

So our V5 version behaves exactly like the System V version in that respect. Furthermore explicitly padding set2 with set1 yields the same result for all implementations because it insures that set1 and set2 have the same number of elements (and it's when you don't have this that results vary historically).

Finally, explicitly padding or having tr pad set2 with set1 as described in the original V4-V5 manuals means the same thing as truncating set1 to the length of set2 insofar as results are concerned - it IS the classic System V implementation for padding and yields the same results. V5 tr is not a different implementation, despite the difference in the man pages.

Best Answer

Related Solutions

shell – Difference Between $(stuff) and `stuff`

Is the historical Unix V5 tr command padding behavior of set2 different from what we consider today “classic” System V (1983-1988) behavior

Related Question