Difference between data and text

text editing

This might sound like a very silly question to you all and in fact it is, but, I really got myself confused figuring out exact difference between data and text.

I will put my observation, suppose we have 2 files, text.txt and data.png. As you might have already guessed, the former one is a simple text file which contains text, it can be opened by a simple text editor and it's contents are what we call text, right?

Now, the latter one is a picture, whose contents are called as Data, right? But, although it's an image and when you would open it, it would display an image on your computer but, if you change it's extension to something like .txt or to say, if you open it up with a text editor like notepad(with utf-8 encoding), we see text, but extremely obscured. But this atleast proves that an image file also contains text, then what's data? where's data? Is that text data? In computer terms, how would I distinguish between text and data?

Yet another observation that I would like to share, I was practicing steganography and I was successful in adding some text at the end of the image file and it didn't even corrupt it! So, the text that I added wasn't data?

Thanks

PS : I don't know what tag to select for such a question.

Best Answer

Firstly, both forms are 'data' in a sense, and getting down to basics, these are both stored in exactly the same way at the base level, in a binary format. Whether it's text, numerical, executable, anything, it is all stored in binary, a combination of 0's and 1's, on the storage medium you are using.

So, why does what you refer to as text display the way it does?

All text is stored again, as a combination of 0 or 1. But that in itself is fairly useless to an end user who wants to see the value stored on the drive. This is where character encoding comes in to play.

You may have heard of some different types of character encoding before, such as ASCII and UTF. These are used to map stored binary to a character you recognise (which will then be displayed using a certain font, but that's slightly outside of this scope).

Using ASCII as an example, characters are stored in 7 bits (where a byte consists of 8 bits), from 0000000 to 1111111. You can see how each character is mapped here:

enter image description here From http://www.asciitable.com/

Each character, that is, uppercase, lowercase, symbol and "special characters", are intepreted by a certain value. Using Hello as an example:

`H` -> Decimal 72 -> Binary 01001000
`e` -> Decimal 101 -> Binary 01100101
`l` -> Decimal 108 -> Binary 01101100
`l` -> Decimal 108 -> Binary 01101100
`o` -> Decimal 111 -> Binary 01101111

Other character maps will use all 8 bits, or even more than 1 byte, to store a character, allowing for larger alphabets or multiple alphabets and more symbols to be stored in the same file, using the same encoding.

So we can see how binary can now be converted into what we consider "text".

But what happens when you open another file type, not considered text?

Every file on your machine, being stored as binary, can be opened by a text editor which will attempt to read the file a form of encoding. Of course, what is displayed will be absolute gibberish, as the file wasn't encoded to be read by a character map, but instead to be executed in a different way. A lot of the bytes will coincidentally match a character from the map it is using, which will occasionally mean you see the odd character you recognise. The rest either won't be mapped, and return an odd or missing character, or will match different parts of the map which makes no sense to decode into. There's nothing stopping it trying though.

I edited the data for a PNG file, but it still opened and didn't corrupt. Why?

Looking here you can see the structure of a PNG file. Specifically:

Chunks can appear in any order, subject to the restrictions placed on each chunk type. (One notable restriction is that IHDR must appear first and IEND must appear last; thus the IEND chunk serves as an end-of-file marker.) Multiple chunks of the same type can appear, but only if specifically permitted for that type.

This particular file type provides an end-of-file marker, which will tell the reader that the data beyond this point is not part of the file make up itself. As such, you could add data beyond this and it may not cause a problem if the reader is handling the file correctly. That said, if you add another EoF marker it could cause confusion.

Another thing to note is that the file type is made of chunks, each with a CRC check. The CRC check tells the reader if the chunk is valid and hasn't been altered, and should always be present. The reader may be trained to disregard data not including a valid data chunk and CRC combination, although I would suspect this to throw an error of some form.


Further reading:

ASCII

Binary File

Character Encoding

Related Question