Pdf – FTP Upload Corrupting PDF

ftppdf

I have a 100kb PDF file that we'll call Test.pdf. I'm using FTP to put Test.pdf on my website. However, the PDF is corrupted when it arrives on the website. So as a diagnostic test, I ran:

$ md5sum Test.pdf
[md5sum a]
$ [ftp upload Test.pdf]
$ [ftp download Test.pdf]
$ md5sum Test.pdf
[md5sum b]

So at some point in the uploading process, the file is being corrupted! This is baffling me. I've never had this problem with any other filetype. I also tried using my website provider's manual upload client, but ran into the same problem. What's going on here?

Best Answer

You already self answered, but I think I can do better than Apparently certain types of files need to be uploaded in binary.

First some small background information:

1: Computers, bits and bytes.

The smallest part of information in a computer is a bit. A bit is either true or false, ) or 1, high voltage or ground, ...

The bits are grouped into small sets. For almost all modern computers in groups of eight. We call this a byte.

A set of 8 bits / 1 bytes, can have 256 different values, starting at
00000000 meaning 0
00000001 meaning 1
00000010 meaning 2
00000011 meaning 3 (both 2+1 are set)
00000100 meaning 4
...
11111111 meaning 255

2: ASCII.

ASCII is a set of 128 characters, numbered 0 to 127. You only need 7 bits for this. On ye old days this was all you needed for communication. Just the regular 26 letter in the western alfabet, the number 0 to 9 and some special codes sunch as 7: Ring the bell or beep.

These days we define much more characters. We use UTF-16 and unicode, allowing chinese, japanese, right-to-left language etc etc. Back in ye old days we did not yet have support for this in common places.

3: Lastly: Bandwidth is/was expensive.

We send all 8 bits of a bit to a destination when you know that you only need 7 of them to represent the text? If you do things in a smart way you can save 1/8th bandwidth.

That might not sound as much to use today, but in the era when the Europe to US connection a 1200 baud dial-in line (that is about 0.1KB/sec!) every little bit helped.

So suppose I want to write "Hello".

I can look that up in the ASCII table and I will discover that your computer would store that in four bytes containing this:

H        e        l        l        o
01001000 01100101 01101100 01101100 01101111  

Note that the first bits of all letters is 0. I might just as well remember this part:

H        e        l        l        o
 1001000  1100101  1101100  1101100 1101111  

The first example has 32 bits (4 bytes, each 8 bits of information).
The second example only has 28 bits. It is more efficient.

This makes it the preferred method of transferring text. However leaving out the first bit will break anything which is not text. Thus the FTP protocol was designed twith two options: ASCII mode (efficient for text), and BINary mode (transfer as it is).


OK, with all that known:

You transferred binary files (e.g PDF's) in ASCII mode, which did not transmit all information. Thus the resulting files arrived mangled on the destination

To transfer anything but plain old text, use the 'bin' command on the FTP prompt or tick the 'bin' option of you use a GUI.

I hope that answers the "What's going on here?" :)

Related Question