I would like to know how file types are known if filenames don't have suffixes.
For example, a file named myfile
could be binary or text to start with, how does the system know if the file is binary or text?
file openingfilesfilesystemsmime-types
I would like to know how file types are known if filenames don't have suffixes.
For example, a file named myfile
could be binary or text to start with, how does the system know if the file is binary or text?
Best Answer
The
file
utility determines the filetype over 3 ways:First the filesystem tests: Within those tests one of the stat family system calls is invoked on the file. This returns the different unix file types: regular file, directory, link, character device, block device, named pipe or a socket. Depending on that, the magic tests are made.
The magic tests are a bit more complex. File types are guessed by a database of patterns called the magic file. Some file types can be determined by reading a bit or number in a particular place within the file (binaries for example). The magic file contains "magic numbers" to test the file whether it contains them or not and which text info should be printed. Those "magic numbers" can be 1-4Byte values, strings, dates or even regular expressions. With further tests additional information can be found. In case of an executable, additional information would be whether it's dynamically linked or not, stripped or not or the architecture. Sometimes multiple tests must pass before the file type can be truly identified. But anyway, it doesn't matter how many tests are performed, it's always just a good guess.
Here are the first 8 bytes in a file of some common filetypes which can help us to get a feeling of what these magic numbers can look like:
If the file type can't be found over magic tests, the file seems to be a text file and
file
looks for the encoding of the contents. The encoding is distinguished by the different ranges and sequences of bytes that constitute printable text in each set.The line breaks are also investigated, depending on their HEX values:
0A
(\n
) classifies a Un*x/Linux/BSD/OSX terminated file0D 0A
(\r\n
) are file from Microsoft operating systems0D
(\r
) would be Mac OS until version 915
(\025
) would be IBMs AIXNow the language tests start. If it appears to be a text file, the file is searched for particular strings to find out which language it contains (C, Perl, Bash). Some script languages can also be identified over the hashbang (
#!/bin/interpreter
) in the first line of the script.If nothing applies to the file, the file type can't be determined and
file
just prints "data".So, you see there is no need for a suffix. A suffix anyway could confuse, if set wrong.