This might sound like a very silly question to you all and in fact it is, but, I really got myself confused figuring out exact difference between data and text.
I will put my observation, suppose we have 2 files, text.txt
and data.png
. As you might have already guessed, the former one is a simple text file which contains text, it can be opened by a simple text editor and it's contents are what we call text, right?
Now, the latter one is a picture, whose contents are called as Data, right? But, although it's an image and when you would open it, it would display an image on your computer but, if you change it's extension to something like .txt or to say, if you open it up with a text editor like notepad(with utf-8 encoding), we see text, but extremely obscured. But this atleast proves that an image file also contains text, then what's data? where's data? Is that text data? In computer terms, how would I distinguish between text and data?
Yet another observation that I would like to share, I was practicing steganography and I was successful in adding some text at the end of the image file and it didn't even corrupt it! So, the text that I added wasn't data?
Thanks
PS : I don't know what tag to select for such a question.
Best Answer
Firstly, both forms are 'data' in a sense, and getting down to basics, these are both stored in exactly the same way at the base level, in a binary format. Whether it's text, numerical, executable, anything, it is all stored in binary, a combination of 0's and 1's, on the storage medium you are using.
So, why does what you refer to as text display the way it does?
All text is stored again, as a combination of 0 or 1. But that in itself is fairly useless to an end user who wants to see the value stored on the drive. This is where character encoding comes in to play.
You may have heard of some different types of character encoding before, such as ASCII and UTF. These are used to map stored binary to a character you recognise (which will then be displayed using a certain font, but that's slightly outside of this scope).
Using ASCII as an example, characters are stored in 7 bits (where a byte consists of 8 bits), from 0000000 to 1111111. You can see how each character is mapped here:
From http://www.asciitable.com/
Each character, that is, uppercase, lowercase, symbol and "special characters", are intepreted by a certain value. Using
Hello
as an example:Other character maps will use all 8 bits, or even more than 1 byte, to store a character, allowing for larger alphabets or multiple alphabets and more symbols to be stored in the same file, using the same encoding.
So we can see how binary can now be converted into what we consider "text".
But what happens when you open another file type, not considered text?
Every file on your machine, being stored as binary, can be opened by a text editor which will attempt to read the file a form of encoding. Of course, what is displayed will be absolute gibberish, as the file wasn't encoded to be read by a character map, but instead to be executed in a different way. A lot of the bytes will coincidentally match a character from the map it is using, which will occasionally mean you see the odd character you recognise. The rest either won't be mapped, and return an odd or missing character, or will match different parts of the map which makes no sense to decode into. There's nothing stopping it trying though.
I edited the data for a PNG file, but it still opened and didn't corrupt. Why?
Looking here you can see the structure of a PNG file. Specifically:
This particular file type provides an end-of-file marker, which will tell the reader that the data beyond this point is not part of the file make up itself. As such, you could add data beyond this and it may not cause a problem if the reader is handling the file correctly. That said, if you add another EoF marker it could cause confusion.
Another thing to note is that the file type is made of chunks, each with a CRC check. The CRC check tells the reader if the chunk is valid and hasn't been altered, and should always be present. The reader may be trained to disregard data not including a valid data chunk and CRC combination, although I would suspect this to throw an error of some form.
Further reading:
ASCII
Binary File
Character Encoding