Does copying text between Notepad++ files create files with different bytes

character encodingencodingnotepadpdf

I've created a simple pdf [hi.pdf] with the word hi and when I open it in Notepad++, its encoding is ANSI, which I assume is Notepad++'s best guess, with it opening successfully when I Save as hiSaveAs.pdf.

However, when I copy the contents of hi.pdf from Notepad++, pasting into a new file and saving as hiANSI.pdf with an encoding of ANSI, the file is corrupted and can't be opened:

Error, failed to load pdf document.
  • When I re-open hiANSI.pdf in Notepad++, it has UTF8 listed as the encoding and when I compare it to hi.pdf, I notice it has whitespaces where hi.pdf has the NUL character:
    • hi.pdf: Screenshot1
    • hiANSI.pdf: Screenshot2
  • If I change the encoding of hiANSI.pdf to ANSI instead of UTF8, the text differs from hi.pdf even more: Screenshot3

Can someone explain what is happening here?

  • Why does Save as work, but copying the exact same text into a new Notepad++ file results with a whitespace instead of the NUL char?
  • Why does Notepad++ think hiANSI.pdf is UTF8, but hi.pdf ANSI?

This does not answer this question.

The MSB is not being stripped. Have a look at the hex comparison:

enter image description here

For example, why is 0A being added between 0D and 25 (first row, 10th byte)?

UPDATE:

I noticed Notepad did much less than Notepad++ in terms of "helping". For example when I saved hi.pdf as hiANSI.pdf using Notepad instead of Notepad++, the only thing Notepad did to help was add 0x0A (line feed) after 0x0D (carriage return), and replaced 0x00 (NUL) with 0x20 (space):

enter image description here

If I saved hi.pdf as hiANSI.bin, it did even less. It just replaced 0x00 with 0x20:

enter image description here

In the above two cases, it produced a valid PDF but with "hi" replaced with "IJ":

enter image description here

UPDATE

If I replace the following 0x20 bytes in hiANSI.pdf with 0x00 to match hi.pdf, it displays "hi" instead of "IJ" but with a different font:

Left is hi.pdf, right is hiANSI.pdf

Here are the two bytes I changed (highlighted in yellow):

enter image description here

Why does changing these two bytes have this effect?

Best Answer

Notepad++ is a text-editor, not a binary editor, so it "corrected" the text when pasting.

In your example, the 0D was taken to be carriage-return, which was taken to be part of the end-of-line character in Windows, but still missing the 0A (line-feed). So Notepad++ has thoughtfully corrected your text.

For more information see Wikipedia:

For a freeware hex editor, see for example HxD.

Related Question