Is the historical Unix V5 tr command padding behavior of set2 different from what we consider today “classic” System V (1983-1988) behavior

historytr

The tr command is almost 40 years old. It seemingly appeared in Unix for the first time in 1973 with Unix V4. The source for this is not available. Here is probably the second oldest available Unix implementation of the command from Unix V5 about 6 months later in June 1974:

int dflag 0;
int sflag 0;
int cflag 0;
int save 0;
char code[256];
char squeez[256];
char vect[256];
struct string { int last, max, rep; char *p; } string1, string2;
int inbuf[259];

main(argc,argv)
char **argv;
{
    int i, j;
    int c, d;
    char *compl;
    extern fout;

    string1.last = string2.last = 0;
    string1.max = string2.max = 0;
    string1.rep = string2.rep = 0;
    string1.p = string2.p = "";

    if(--argc>0) {
        argv++;
        if(*argv[0]=='-'&&argv[0][4]!=0) {
            while(*++argv[0])
                switch(*argv[0]) {
                case 'c':
                    cflag++;
                    continue;
                case 'd':
                    dflag++;
                    continue;
                case 's':
                    sflag++;
                    continue;
                }
            argc--;
            argv++;
        }
    }
    if(argc>0) string1.p = argv[0];
    if(argc>1) string2.p = argv[1];
    for(i=0; i<256; i++)
        code[i] = vect[i] = 0;
    if(cflag) {
        while(c = next(&string1))
            vect[c&0377] = 1;
        j = 0;
        for(i=1; i<256; i++)
            if(vect[i]==0) vect[j++] = i;
        vect[j] = 0;
        compl = vect;
    }
    for(i=0; i<256; i++)
        squeez[i] = 0;
    for(;;){
        if(cflag) c = *compl++;
        else c = next(&string1);
        if(c==0) break;
        d = next(&string2);
        if(d==0) d = c;
        code[c&0377] = d;
        squeez[d&0377] = 1;
    }
    while(d = next(&string2))
        squeez[d&0377] = 1;
    squeez[0] = 1;
    for(i=0;i<256;i++) {
        if(code[i]==0) code[i] = i;
        else if(dflag) code[i] = 0;
    }

    inbuf[0] = 0;
    fout = dup(1);
    close(1);
    while((c=getc(inbuf)) >=0 ) {
        if(c == 0) continue;
        if(c = code[c&0377]&0377)
            if(!sflag || c!=save || !squeez[c&0377])
                putchar(save = c);
    }
    flush();
}

next(s)
struct string *s;
{
    int a, b, c, n;
    int base;

    if(--s->rep > 0) return(s->last);
    if(s->last < s->max) return(++s->last);
    if(*s->p=='[') {
        nextc(s);
        s->last = a = nextc(s);
        s->max = 0;
        switch(nextc(s)) {
        case '-':
            b = nextc(s);
            if(b<a || *s->p++!=']')
                goto error;
            s->max = b;
            return(a);
        case '*':
            base = (*s->p=='0')?8:10;
            n = 0;
            while((c = *s->p)>='0' && c<'0'+base) {
                n = base*n + c - '0';
                s->p++;
            }
            if(*s->p++!=']') goto error;
            if(n==0) n = 1000;
            s->rep = n;
            return(a);
        default:
        error:
            write(1,"Bad string\n",11);
            exit();
        }
    }
    return(nextc(s));
}

nextc(s)
struct string *s;
{
    int c, i, n;

    c = *s->p++;
    if(c=='\\') {
        i = n = 0;
        while(i<3 && (c = *s->p)>='0' && c<='7') {
            n = n*8 + c - '0';
            i++;
            s->p++;
        }
        if(i>0) c = n;
        else c = *s->p++;
    }
    if(c==0) *--s->p = 0;
    return(c&0377);
}

With time this evolved and since the early days, there's been variations with how the command deals with sets of different length, and here my interest is with the case where set2 is shorter than set1.

The GNU Coreutils manual discusses the situation:

When tr is performing translation, set1 and set2 typically have the
same length. If set1 is shorter than set2, the extra characters at the
end of set2 are ignored.

On the other hand, making set1 longer than set2 is not portable; POSIX
says that the result is undefined. In this situation, BSD tr pads set2
to the length of set1 by repeating the last character of set2 as many
times as necessary. System V tr truncates set1 to the length of set2.

By default, GNU tr handles this case like BSD tr. When the
–truncate-set1 (-t) option is given, GNU tr handles this case like the System V tr >instead. This option is ignored for operations other than translation.

Acting like System V tr in this case breaks the relatively common BSD
idiom:

 tr -cs A-Za-z0-9 '\012'

because it converts only zero bytes (the first element in the
complement of set1), rather than all non-alphanumerics, to newlines.

There is also such a discussion in The Open Group Base Specifications Issue 7 IEEE Std 1003.1, 2013 Edition:

When string2 is shorter than string1, a difference results between
historical System V and BSD systems. A BSD system pads string2 with
the last character found in string2. Thus, it is possible to do the
following:

tr 0123456789 d

which would translate all digits to the letter 'd'. Since this area is
specifically unspecified in this volume of POSIX.1-2008, both the BSD
and System V behaviors are allowed, but a conforming application
cannot rely on the BSD behavior. It would have to code the example in
the following way:

tr 0123456789 '[d*]'

Now, if you read the man pages for the tr command in V4 and V5, you see the following reference in both:

If string2 is short, it is padded with corresponding characters from string1.

But that reference is omitted in the V6 manual and later early versions of Unix yet the V6 implementation of the command is line for line identical to V5? So you have a difference in the manuals but not in the code? Also, this implementation seems different from what is considered "classic BSD or System V" behavior i.e. pad from adding from set2 elements or truncate to length of set1.

So is the V4-V5 implementation different from the System V milestone Unix and what is the rationale for that different implementation and ultimately why was it discarded down the road? How can I find out more information about such an early design of the command?

Best Answer

The difference is only in the wording of the padding behavior in the V4-V5 manual - but the behavior is the same throughout. As it stands the results of the V5 implementation is identical to that of the System V one, which is itself identical to the GNU tr behavior with the --truncate-set1 option. Furthermore, "truncating set1 to the lenght of set2" gives the same result as "padding string2 with corresponding characters from string1". It means the same thing in practice. Let's demonstrate this.

First, you need not be a developer to try to compile this. Compare the source code with the almost identical PWB/Unix version. You will see the only difference being reliance on the "modern" stdio.h assets basically, so I've stripped the source of its references to inbuf, fout, dup and flush and replaced it with what PWB/Unix does - but this in no way should alter the behavior as the algorithms remain untouched. I've annotated the trivial changes I've made from the original:

#include <stdio.h>    <------ added
int dflag = 0;        <------ added "=" sign to those
int sflag = 0;
int cflag = 0;
int save = 0;
char code[256];
char squeez[256];
char vect[256];
struct string { int last, max, rep; char *p; } string1, string2;
FILE *input;          <------ part of the stdio framework I guess;

main(argc,argv)
char **argv;
{
    int i, j;
    int c, d;
    char *compl;

    string1.last = string2.last = 0;
    string1.max = string2.max = 0;
    string1.rep = string2.rep = 0;
    string1.p = string2.p = "";

    if(--argc>0) {
        argv++;
        if(*argv[0]=='-'&&argv[0][1]!=0) {
            while(*++argv[0])
                switch(*argv[0]) {
                case 'c':
                    cflag++;
                    continue;
                case 'd':
                    dflag++;
                    continue;
                case 's':
                    sflag++;
                    continue;
                }
            argc--;
            argv++;
        }
    }
    if(argc>0) string1.p = argv[0];
    if(argc>1) string2.p = argv[1];
    for(i=0; i<256; i++)
        code[i] = vect[i] = 0;
    if(cflag) {
        while(c = next(&string1))
            vect[c&0377] = 1;
        j = 0;
        for(i=1; i<256; i++)
            if(vect[i]==0) vect[j++] = i;
        vect[j] = 0;
        compl = vect;
    }
    for(i=0; i<256; i++)
        squeez[i] = 0;
    for(;;){
        if(cflag) c = *compl++;
        else c = next(&string1);
        if(c==0) break;
        d = next(&string2);
        if(d==0) d = c;
        code[c&0377] = d;
        squeez[d&0377] = 1;
    }
    while(d = next(&string2))
        squeez[d&0377] = 1;
    squeez[0] = 1;
    for(i=0;i<256;i++) {
        if(code[i]==0) code[i] = i;
        else if(dflag) code[i] = 0;
    }

    input = stdin;                     <------ again stdio
    while((c=getc(input)) != EOF ) {   <------
        if(c == 0) continue;
        if(c = code[c&0377]&0377)
            if(!sflag || c!=save || !squeez[c&0377])
                putchar(save = c);
    }

}

next(s)
struct string *s;
{
    int a, b, c, n;
    int base;

    if(--s->rep > 0) return(s->last);
    if(s->last < s->max) return(++s->last);
    if(*s->p=='[') {
        nextc(s);
        s->last = a = nextc(s);
        s->max = 0;
        switch(nextc(s)) {
        case '-':
            b = nextc(s);
            if(b<a || *s->p++!=']')
                goto error;
            s->max = b;
            return(a);
        case '*':
            base = (*s->p=='0')?8:10;
            n = 0;
            while((c = *s->p)>='0' && c<'0'+base) {
                n = base*n + c - '0';
                s->p++;
            }
            if(*s->p++!=']') goto error;
            if(n==0) n = 1000;
            s->rep = n;
            return(a);
        default:
        error:
            write(1,"Bad string\n",11);
            exit(0);     <------original was exit();
        }
    }
    return(nextc(s));
}

nextc(s)
struct string *s;
{
    int c, i, n;

    c = *s->p++;
    if(c=='\\') {
        i = n = 0;
        while(i<3 && (c = *s->p)>='0' && c<='7') {
            n = n*8 + c - '0';
            i++;
            s->p++;
        }
        if(i>0) c = n;
        else c = *s->p++;
    }
    if(c==0) *--s->p = 0;
    return(c&0377);
}

So cc tr.c compiles:

tr.c: In function ‘next’:
tr.c:118:4: warning: incompatible implicit declaration of built-in function ‘exit’ 
[enabled by default]
exit(0);
^

But a.out is there and works, so let's now compare the padding behavior of the two programs we have:

GNU tr

#tr 0123456789 d     
0123456789 input
dddddddddd output             <----- BSD classic behavior

#tr 0123456789 d123456789     <----- padding set2 with set1 explicitly 
0123456789 i
d123456789 o
01234567890123456789 i
d123456789d123456789 o

#tr -t 0123456789 d           <----- --truncate-set1 i.e. System V behavior
0123456789 i
d123456789 o                  <----- concretely, this is what is meant by a result 
0012 i                               where set2 was padded with set1
dd12 o

#tr -t 0123456789 d123456789  <----- padding set2 with set1 explicitly
0123456789 i                  
d123456789 o                  <----- note this is identical to the last results

Unix V5 tr + stdio mod

#./a.out 0123456789 d         <----- our compiled version with the classic example
0123456789 i
d123456789 o

./a.out 0123456789 d123456789 <----- padding set2 with set1 explicitly
0123456789 i
d123456789 o

So our V5 version behaves exactly like the System V version in that respect. Furthermore explicitly padding set2 with set1 yields the same result for all implementations because it insures that set1 and set2 have the same number of elements (and it's when you don't have this that results vary historically).

Finally, explicitly padding or having tr pad set2 with set1 as described in the original V4-V5 manuals means the same thing as truncating set1 to the length of set2 insofar as results are concerned - it IS the classic System V implementation for padding and yields the same results. V5 tr is not a different implementation, despite the difference in the man pages.

Related Question