Filesystems – How Are File Types Known If Not from File Suffix

file openingfilesfilesystemsmime-types

I would like to know how file types are known if filenames don't have suffixes.

For example, a file named myfile could be binary or text to start with, how does the system know if the file is binary or text?

Best Answer

The file utility determines the filetype over 3 ways:

First the filesystem tests: Within those tests one of the stat family system calls is invoked on the file. This returns the different unix file types: regular file, directory, link, character device, block device, named pipe or a socket. Depending on that, the magic tests are made.

The magic tests are a bit more complex. File types are guessed by a database of patterns called the magic file. Some file types can be determined by reading a bit or number in a particular place within the file (binaries for example). The magic file contains "magic numbers" to test the file whether it contains them or not and which text info should be printed. Those "magic numbers" can be 1-4Byte values, strings, dates or even regular expressions. With further tests additional information can be found. In case of an executable, additional information would be whether it's dynamically linked or not, stripped or not or the architecture. Sometimes multiple tests must pass before the file type can be truly identified. But anyway, it doesn't matter how many tests are performed, it's always just a good guess.

Here are the first 8 bytes in a file of some common filetypes which can help us to get a feeling of what these magic numbers can look like:

             Hexadecimal          ASCII
PNG   89 50 4E 47|0D 0A 1A 0A   ‰PNG|....
JPG   FF D8 FF E1|1D 16 45 78   ÿØÿá|..Ex
JPG   FF D8 FF E0|00 10 4A 46   ÿØÿà|..JF
ZIP   50 4B 03 04|0A 00 00 00   PK..|....
PDF   25 50 44 46|2D 31 2E 35   %PDF|-1.5

If the file type can't be found over magic tests, the file seems to be a text file and file looks for the encoding of the contents. The encoding is distinguished by the different ranges and sequences of bytes that constitute printable text in each set.

The line breaks are also investigated, depending on their HEX values:

  • 0A (\n) classifies a Un*x/Linux/BSD/OSX terminated file
  • 0D 0A (\r\n) are file from Microsoft operating systems
  • 0D (\r) would be Mac OS until version 9
  • 15 (\025) would be IBMs AIX

Now the language tests start. If it appears to be a text file, the file is searched for particular strings to find out which language it contains (C, Perl, Bash). Some script languages can also be identified over the hashbang (#!/bin/interpreter) in the first line of the script.

If nothing applies to the file, the file type can't be determined and file just prints "data".

So, you see there is no need for a suffix. A suffix anyway could confuse, if set wrong.

Related Question