I can't figure out, what algorithm was used in Unix V5 and V6 for the sum
command.
At first, I thought that it is a simple sum of bytes modulus 2^16. However for a string "1111111111\n" repeated 320 times, it computes a checksum of 28930 (using Julius Schmidt's PDP-11 emulator for JavaScript). Whereas a simple sum of bytes for it is two bytes smaller:
$ python -c 'print(sum(bytearray(b"1111111111\n"*320)) & 0xFFFF)'
28928
Later, from the MacOS man pages, I found out that the sum
and cksum
commands had a long history of inconsistencies. However, even the "historic" algorithm versions provided on MacOS don't agree with the Unix V5's checksum. The closest match is the default sum
command of UNIX System V (invoked on Mac like cksum -o 2
), which returns the same checksum for this string, but disagrees for others:
$ python -c 'print("1111111111\n"*320, end="")' | cksum -o 2
28930 7
To be more specific, the cksum -o 2
and Unix V5's sum
produce different output for most binary files in the emulator (for example, in folder /bin
), though they agree on most text files.
Is this a genuine behavior or a bug in the emulator? If it is genuine, what algorithm is it?
P.S. Here is the source code, if anyone can read assembly code from 1974.
Best Answer
It is a sum mod 2^16, only that each time it overflows, a 1 will be added to it. Also, the bytes will be sign extended before being added to the sum. Here is a "commented" snippet from the assembly:
the same put into a small C program (use as
./v5sum < file
):That's because the original unix v5
sum
will sign extend the characters, and only the binaries contain bytes >= 0x80. Otherwise, the algorithms should be similar, and differ only with very big files (where the sum of the characters will overflow a 32bit unsigned int).