Why do tar and gzip files usually have a file extension

filenamesfiles

File extensions are not necessary on unices, still every tarred, gzipped or bzipped file I encounter has a file extension like .tar, .tar.gz or .tgz.

Is there any special reason for that or is that just convention?

Best Answer

Originally, on unix systems, the extensions on file names were a matter of convention. They allowed a human being to choose the right program to open a file. The modern convention is to use extensions in most cases; common exceptions are:

Only regular files have an extension, not directories or device names. The mere fact of being a directory or device is enough file type indication.
Executables that are meant to be invoked directly don't have an extension. The mere fact of being executable is enough information for the user, and the kernel doesn't care about file names.
Files beginning with a word in all caps are often text files, e.g. README, TODO. Sometimes there is an additional part that indicate a subcategory, e.g. INSTALL.linux, INSTALL.solaris.
Files whose name begins with a dot are configuration or state files of a particular application, and often don't have an extension, e.g. .bashrc, .profile, .emacs.
There are a few traditional cases, e.g. Makefile.

(These are common cases, not hard-and-fast rules.)

Most binary file formats also contain some kind of header that describes properties of the file, and typically allows the file format to be identified through magic numbers. The file command looks at this information and shows you its guesses.

Sometimes the file extension gives more information than the file format, sometimes it's the other way round. For example many file formats consist of a zip archive: Java libraries (.jar), OpenOffice documents (.odt, …), Microsoft Office document (.docx, …), etc. Another example is source code files, where the extension indicates the programming language, which can be difficult for a computer to guess automatically from the file contents. Conversely, some extensions are wildly ambiguous, for example .o is used for compiled code files (object files), but inspection of the file contents usually easily reveals what machine type and operating system the object file is for.

An advantage of the extension is that it's a lot faster to recognize it than to open the file and look for magic sequences. For example completion of file names in shells is almost always based on the name (mainly the extension), because reading every file in a large directory can take a long time whereas just reading the file names is fast enough for a Tab press.

Sometimes changing a file's extension can allow you to say how a file is to be interpreted, when two file formats are almost, but not wholly identical. For example a web server might treat .shtml and .html differently, the former undergoing some server-side preprocessing, the latter being served as-is.

In the case of gzip archives, gzip won't recompress files whose name ends in .gz, .tgz and a few other extensions. That way you can run gzip * to compress every file in a directory, and already compressed files are not modified.

Related Solutions

How to Grab File Extension in Bash

If the file name is file-1.0.tar.bz2, the extension is bz2. The method you're using to extract the extension (fileext=${filename##*.}) is perfectly valid¹.

How do you decide that you want the extension to be tar.bz2 and not bz2 or 0.tar.bz2? You need to answer this question first. Then you can figure out what shell command matches your specification.

One possible specification is that extensions must begin with a letter. This heuristic fails for a few common extensions like 7z, which might be best treated as a special case. Here's a bash/ksh/zsh implementation:

basename=$filename; fileext=
while [[ $basename = ?*.* &&
         ( ${basename##*.} = [A-Za-z]* || ${basename##*.} = 7z ) ]]
do
  fileext=${basename##*.}.$fileext
  basename=${basename%.*}
done
fileext=${fileext%.}

For POSIX portability, you need to use a case statement for pattern matching.

while case $basename in
        ?*.*) case ${basename##*.} in [A-Za-z]*|7z) true;; *) false;; esac;;
        *) false;;
      esac
do …

Another possible specification is that some extensions denote encodings and indicate that further stripping is needed. Here's a bash/ksh/zsh implementation (requiring shopt -s extglob under bash and setopt ksh_glob under zsh):
```
basename=$filename
fileext=
while [[ $basename = ?*.@(bz2|gz|lzma) ]]; do
  fileext=${basename##*.}.$fileext
  basename=${basename%.*}
done
if [[ $basename = ?*.* ]]; then
  fileext=${basename##*.}.$fileext
  basename=${basename%.*}
fi
fileext=${fileext%.}
```
Note that this considers 0 to be an extension in file-1.0.gz.

¹ _{${VARIABLE##SUFFIX} and related constructs are in POSIX, so they work in any non-antique Bourne-style shell such as ash, bash, ksh or zsh.}

Which extension to use for text files? (Unix/Linux)

UNIX/Linux does not have the same early DOS / CP/M heritage that Windows does. So extensions are generally less significant to most UNIX utilities and tools.

I usually use a command-line only environment. Extensions in such an environment under Linux aren't really significant except as a convenience to the operator or user. (I don't have enough experience with KDE or GNOME to know how their filemanagers deal with extensions.)

But such convenience is usually important. If config.ini is really in Microsoft-standard ".ini" format, I'd let the extension stand. Plain old text files usually carry no extension in Linux, but this isn't universal for all programs configuration files. The programmer usually gets to decide that.

I think ".txt" is useful under Linux if you want to emphasize that it's NOT a configuration file or other machine-readable document. However, in source distributions, the convention is to name such files all caps without an extension (i.e. README, INSTALL, COPYING, etc.)

There are some standards and conventions but nothing stopping you from naming anything whatever you want, unless you are sharing things with others.

In Windows, naming a file .exe indicates to the shell (usually explorer.exe) that it's an executable file. UNIX builds this knowledge into the file system's permissions. If the proper x bits (see man chmod) are set, it is recognized as executable by shells and kernel functions (I believe). Beyond this, Linux doesn't care, most shells won't care, and most programs look in the file to find it's "type."

Of course, there's the nice command file which can analyze the file and tell you what it is with a degree of certainty. I believe if it can't match the data in the file with any known type, and if it contains only printable ASCII/Unicode characters, then it assumes its a text file.

@Bruce Ediger below is absolutely correct. There is nothing in the kernel or filesystem level, i.e. Linux itself, enforcing or caring that the contents of a file needs to match up with its name, or the program that is supposed to understand it. This doesn't mean it's not possible to create a shell or launcher utility to do things based on filename.

Best Answer

Related Solutions

How to Grab File Extension in Bash

Which extension to use for text files? (Unix/Linux)

Related Question