Filesystems – How Are File Types Known If Not from File Suffix

file openingfilesfilesystemsmime-types

I would like to know how file types are known if filenames don't have suffixes.

For example, a file named myfile could be binary or text to start with, how does the system know if the file is binary or text?

Best Answer

The file utility determines the filetype over 3 ways:

First the filesystem tests: Within those tests one of the stat family system calls is invoked on the file. This returns the different unix file types: regular file, directory, link, character device, block device, named pipe or a socket. Depending on that, the magic tests are made.

The magic tests are a bit more complex. File types are guessed by a database of patterns called the magic file. Some file types can be determined by reading a bit or number in a particular place within the file (binaries for example). The magic file contains "magic numbers" to test the file whether it contains them or not and which text info should be printed. Those "magic numbers" can be 1-4Byte values, strings, dates or even regular expressions. With further tests additional information can be found. In case of an executable, additional information would be whether it's dynamically linked or not, stripped or not or the architecture. Sometimes multiple tests must pass before the file type can be truly identified. But anyway, it doesn't matter how many tests are performed, it's always just a good guess.

Here are the first 8 bytes in a file of some common filetypes which can help us to get a feeling of what these magic numbers can look like:

             Hexadecimal          ASCII
PNG   89 50 4E 47|0D 0A 1A 0A   ‰PNG|....
JPG   FF D8 FF E1|1D 16 45 78   ÿØÿá|..Ex
JPG   FF D8 FF E0|00 10 4A 46   ÿØÿà|..JF
ZIP   50 4B 03 04|0A 00 00 00   PK..|....
PDF   25 50 44 46|2D 31 2E 35   %PDF|-1.5

If the file type can't be found over magic tests, the file seems to be a text file and file looks for the encoding of the contents. The encoding is distinguished by the different ranges and sequences of bytes that constitute printable text in each set.

The line breaks are also investigated, depending on their HEX values:

0A (\n) classifies a Un*x/Linux/BSD/OSX terminated file
0D 0A (\r\n) are file from Microsoft operating systems
0D (\r) would be Mac OS until version 9
15 (\025) would be IBMs AIX

Now the language tests start. If it appears to be a text file, the file is searched for particular strings to find out which language it contains (C, Perl, Bash). Some script languages can also be identified over the hashbang (#!/bin/interpreter) in the first line of the script.

If nothing applies to the file, the file type can't be determined and file just prints "data".

So, you see there is no need for a suffix. A suffix anyway could confuse, if set wrong.

Related Solutions

Ubuntu – How to open a .bak file on Linux

.bak generally designates that the file is a backup copy of something, but other than that it gives preciously little information as to the actual file type.

Try looking at the output of the file command, which studies the first few bits of the file to see if it recognizes it as a known filetype:

caleburn: ~/ >file image001.jpg 
image001.jpg: JPEG image data, JFIF standard 1.01
caleburn: ~/ >file oops.png 
oops.png: PNG image data, 935 x 546, 16-bit/color RGB, non-interlaced
caleburn: ~/ >file zones.zip 
zones.zip: Zip archive data, at least v2.0 to extract
caleburn: ~/ >file eth2.pcap 
eth2.pcap: tcpdump capture file (little-endian) - version 2.4 (Ethernet, capture length 96)

And so on, and so on. Once you know what type of file linux thinks it is, google should be able to suggest how to access it.

... Alternately, you can ask whoever sent it to you what the original filename was supposed to be and find out that way. :)

How to remove multiple files with a common prefix and suffix

rm sequence_1*.hmf

removes files beginning with sequence_1 and ending with .hmf.

Globbing is the process in which your shell takes a pattern and expands it into a list of filenames matching that pattern. Do not confuse it with regular expressions, which is different. If you spend most of your time in bash, the Wooledge Wiki has a good page on globbing (pathname expansion). If you want maximum portability, you'll want to read the POSIX spec on pattern matching as well / instead.

In the unlikely case you run into an "Argument list too long" error, you can take a look at BashFAQ 95, which addresses this. The simplest workaround is to break up the glob pattern into multiple smaller chunks, until the error goes away. In your case, you could probably get away with splitting the match by prefix digits 0 through 9, as follows:

for c in {0..9}; do rm sequence_1_"$c"*.hmf; done
rm sequence_1*.hmf  # catch-all case

Best Answer

Related Solutions

Ubuntu – How to open a .bak file on Linux

How to remove multiple files with a common prefix and suffix

Related Question