File Command – Classifying Files in Linux

file formatfile-command

I need to recognize type of data contained in random files. I am new to Linux.

I am planning to use the file command to understand what type of data a file has. I tried that command and got the output below.

Someone suggested to me that the file command looks at the initial bytes of a file to determine data type. The file command doesn't look at a file extension at all. Is that correct? I looked at the man page but felt that it was too technical. I would appreciate if anyone can provide a link which has much simpler explanation regarding how the file command works.

What are different possible answers that I could get after running the file command? For example, in the transcript below I get JPEG, ISO media, ASCII, etc:

The screen output is as follows

 m7% file date-file.csv
date-file.csv: ASCII text, with CRLF line terminators
m7% file image-file.JPG
image-file.JPG: JPEG image data, EXIF standard
m7% file music-file.m4a
music-file.m4a: ISO Media, MPEG v4 system, iTunes AAC-LC
m7% file numbers-file.txt
numbers-file.txt: ASCII text
m7% file pdf-file.pdf
pdf-file.pdf: PDF document, version 1.4
m7% file text-file.txt
text-file.txt: ASCII text
m7% file video-file.MOV
video-file.MOV: data


Update 1

Thanks for answers and they clarified a couple of things for me.

So if I understand correctly folder /usr/share/mime/magic has a database that will give me what are the current possible file formats (outputs that I can get when I type file command and follow it by a file). is that correct? Is it true that whenever 'File' command output contains the word "text" it refers to something that you can read with a text viewer, and anything without "text" is some kind of binary?

Best Answer

file uses several kinds of test:

1: If file does not exist, cannot be read, or its file status could not be determined, the output shall indicate that the file was processed, but that its type could not be determined.

This will be output like cannot open file: No such file or directory.

2: If the file is not a regular file, its file type shall be identified. The file types directory, FIFO, socket, block special, and character special shall be identified as such. Other implementation-defined file types may also be identified. If file is a symbolic link, by default the link shall be resolved and file shall test the type of file referenced by the symbolic link. (See the -h and -i options below.)

This will be output like .: directory and /dev/sda: block special. Much of the format for this and the previous point is partially defined by POSIX - you can rely on certain strings being in the output.

3: If the length of file is zero, it shall be identified as an empty file.

This is foo: empty.

4: The file utility shall examine an initial segment of file and shall make a guess at identifying its contents based on position-sensitive tests. (The answer is not guaranteed to be correct; see the -d, -M, and -m options below.)

5: The file utility shall examine file and make a guess at identifying its contents based on context-sensitive default system tests. (The answer is not guaranteed to be correct.)

These two use magic number identification and are the most interesting part of the command. A magic number is a special sequence of bytes that's in a known place in a file that identifies its type. Traditionally that place is the first two bytes, but the term has been extended further to include longer strings and other locations. See this other question for more detail about magic numbers in the file command.

The file command has a database of these numbers and what type they correspond to; that database is usually in /usr/share/mime/magic, and maps file contents to MIME types. The output there (often part of file -i if you don't get it by default) will be a defined media type or an extension. "Context-sensitive tests" use the same sort of approach, but are a bit fuzzier. None of these are guaranteed to be right, but they're intended to be good guesses.

file also has a database mapping those types to names, by which it will know that a file it has identified as application/pdf can be described as a PDF document. Those human-readable names may be localised to another language too. These will always be some high-level description of the file type in a way a person will understand, rather than a machine.

The majority of different outputs you can get will come from these stages. You can look at the magic file for a list of supported types and how they're identified - my system knows 376 different types. The names given and the types supported are determined by your system packaging and configuration, and so your system may support more or fewer than mine, but there are generally a lot of them. libmagic also includes additional hard-coded tests in it.

6: The file shall be identified as a data file.

This is foo: data, when it failed to figure out anything at all about the file.

There are also other little tags that can appear. An executable (+x) file will include "executable" in the output, usually comma-separated. The file implementation may also know extra things about some file formats to be able to describe additional points about them, as in your "PDF document, version 1.4".

Related Question