Checksum – What Checksum Algorithm is Used by Version 5 Unix?

checksumhistorical-unix

I can't figure out, what algorithm was used in Unix V5 and V6 for the sum command.

At first, I thought that it is a simple sum of bytes modulus 2^16. However for a string "1111111111\n" repeated 320 times, it computes a checksum of 28930 (using Julius Schmidt's PDP-11 emulator for JavaScript). Whereas a simple sum of bytes for it is two bytes smaller:

$ python -c 'print(sum(bytearray(b"1111111111\n"*320)) & 0xFFFF)'
28928

Later, from the MacOS man pages, I found out that the sum and cksum commands had a long history of inconsistencies. However, even the "historic" algorithm versions provided on MacOS don't agree with the Unix V5's checksum. The closest match is the default sum command of UNIX System V (invoked on Mac like cksum -o 2), which returns the same checksum for this string, but disagrees for others:

$ python -c 'print("1111111111\n"*320, end="")' | cksum -o 2
28930 7

To be more specific, the cksum -o 2 and Unix V5's sum produce different output for most binary files in the emulator (for example, in folder /bin), though they agree on most text files.

Is this a genuine behavior or a bug in the emulator? If it is genuine, what algorithm is it?

P.S. Here is the source code, if anyone can read assembly code from 1974.

Best Answer

At first, I thought that it is a simple sum of bytes modulus 2^16

It is a sum mod 2^16, only that each time it overflows, a 1 will be added to it. Also, the bytes will be sign extended before being added to the sum. Here is a "commented" snippet from the assembly:

# r2 is the pointer into the data
# r0 is the length of the data
# r5 is the sum
2:
        movb    (r2)+,r4    # r4 = sign_extend(*r2++)
        add     r4,r5       # r5 += r4
        adc     r5          # if(r5 overflowed) r5++
        sob     r0,2b       # if(--r0) goto 2 above

the same put into a small C program (use as ./v5sum < file):

#include <stdio.h>
int main(void){
        int c, s = 0;
        while((c = getchar()) != EOF){
                s += c & 0x80 ? c | 0xff00 : c; // alternative: s += (unsigned short)(signed char)c
                if(s & 0x10000){ s++; s &= 0xffff; };
        }
        printf("%d\n", s);
        return 0;
}

To be more specific, the cksum -o 2 and Unix V5's sum produce different output for most binary files in the emulator (for example, in folder /bin), though they agree on most text files.

That's because the original unix v5 sum will sign extend the characters, and only the binaries contain bytes >= 0x80. Otherwise, the algorithms should be similar, and differ only with very big files (where the sum of the characters will overflow a 32bit unsigned int).

Related Question