I need to recognize type of data contained in random files. I am new to Linux.
I am planning to use the file
command to understand what type of data a file has. I tried that command and got the output below.
Someone suggested to me that the file
command looks at the initial bytes of a file to determine data type. The file
command doesn't look at a file extension at all. Is that correct? I looked at the man page but felt that it was too technical. I would appreciate if anyone can provide a link which has much simpler explanation regarding how the file
command works.
What are different possible answers that I could get after running the file
command? For example, in the transcript below I get JPEG, ISO media, ASCII, etc:
The screen output is as follows
m7% file date-file.csv
date-file.csv: ASCII text, with CRLF line terminators
m7% file image-file.JPG
image-file.JPG: JPEG image data, EXIF standard
m7% file music-file.m4a
music-file.m4a: ISO Media, MPEG v4 system, iTunes AAC-LC
m7% file numbers-file.txt
numbers-file.txt: ASCII text
m7% file pdf-file.pdf
pdf-file.pdf: PDF document, version 1.4
m7% file text-file.txt
text-file.txt: ASCII text
m7% file video-file.MOV
video-file.MOV: data
Update 1
Thanks for answers and they clarified a couple of things for me.
So if I understand correctly folder /usr/share/mime/magic has a database that will give me what are the current possible file formats (outputs that I can get when I type file command and follow it by a file). is that correct? Is it true that whenever 'File' command output contains the word "text" it refers to something that you can read with a text viewer, and anything without "text" is some kind of binary?
Best Answer
file
uses several kinds of test:This will be output like
cannot open file: No such file or directory
.This will be output like
.: directory
and/dev/sda: block special
. Much of the format for this and the previous point is partially defined by POSIX - you can rely on certain strings being in the output.This is
foo: empty
.These two use magic number identification and are the most interesting part of the command. A magic number is a special sequence of bytes that's in a known place in a file that identifies its type. Traditionally that place is the first two bytes, but the term has been extended further to include longer strings and other locations. See this other question for more detail about magic numbers in the
file
command.The
file
command has a database of these numbers and what type they correspond to; that database is usually in/usr/share/mime/magic
, and maps file contents to MIME types. The output there (often part offile -i
if you don't get it by default) will be a defined media type or an extension. "Context-sensitive tests" use the same sort of approach, but are a bit fuzzier. None of these are guaranteed to be right, but they're intended to be good guesses.file
also has a database mapping those types to names, by which it will know that a file it has identified asapplication/pdf
can be described as aPDF document
. Those human-readable names may be localised to another language too. These will always be some high-level description of the file type in a way a person will understand, rather than a machine.The majority of different outputs you can get will come from these stages. You can look at the
magic
file for a list of supported types and how they're identified - my system knows 376 different types. The names given and the types supported are determined by your system packaging and configuration, and so your system may support more or fewer than mine, but there are generally a lot of them.libmagic
also includes additional hard-coded tests in it.This is
foo: data
, when it failed to figure out anything at all about the file.There are also other little tags that can appear. An executable (
+x
) file will include "executable
" in the output, usually comma-separated. Thefile
implementation may also know extra things about some file formats to be able to describe additional points about them, as in your "PDF document, version 1.4
".