Why do tar and gzip files usually have a file extension

filenamesfiles

File extensions are not necessary on unices, still every tarred, gzipped or bzipped file I encounter has a file extension like .tar, .tar.gz or .tgz.

Is there any special reason for that or is that just convention?

Best Answer

Originally, on unix systems, the extensions on file names were a matter of convention. They allowed a human being to choose the right program to open a file. The modern convention is to use extensions in most cases; common exceptions are:

  • Only regular files have an extension, not directories or device names. The mere fact of being a directory or device is enough file type indication.
  • Executables that are meant to be invoked directly don't have an extension. The mere fact of being executable is enough information for the user, and the kernel doesn't care about file names.
  • Files beginning with a word in all caps are often text files, e.g. README, TODO. Sometimes there is an additional part that indicate a subcategory, e.g. INSTALL.linux, INSTALL.solaris.
  • Files whose name begins with a dot are configuration or state files of a particular application, and often don't have an extension, e.g. .bashrc, .profile, .emacs.
  • There are a few traditional cases, e.g. Makefile.

(These are common cases, not hard-and-fast rules.)

Most binary file formats also contain some kind of header that describes properties of the file, and typically allows the file format to be identified through magic numbers. The file command looks at this information and shows you its guesses.

Sometimes the file extension gives more information than the file format, sometimes it's the other way round. For example many file formats consist of a zip archive: Java libraries (.jar), OpenOffice documents (.odt, …), Microsoft Office document (.docx, …), etc. Another example is source code files, where the extension indicates the programming language, which can be difficult for a computer to guess automatically from the file contents. Conversely, some extensions are wildly ambiguous, for example .o is used for compiled code files (object files), but inspection of the file contents usually easily reveals what machine type and operating system the object file is for.

An advantage of the extension is that it's a lot faster to recognize it than to open the file and look for magic sequences. For example completion of file names in shells is almost always based on the name (mainly the extension), because reading every file in a large directory can take a long time whereas just reading the file names is fast enough for a Tab press.

Sometimes changing a file's extension can allow you to say how a file is to be interpreted, when two file formats are almost, but not wholly identical. For example a web server might treat .shtml and .html differently, the former undergoing some server-side preprocessing, the latter being served as-is.

In the case of gzip archives, gzip won't recompress files whose name ends in .gz, .tgz and a few other extensions. That way you can run gzip * to compress every file in a directory, and already compressed files are not modified.