If the file name is file-1.0.tar.bz2
, the extension is bz2
. The method you're using to extract the extension (fileext=${filename##*.}
) is perfectly valid¹.
How do you decide that you want the extension to be tar.bz2
and not bz2
or 0.tar.bz2
? You need to answer this question first. Then you can figure out what shell command matches your specification.
One possible specification is that extensions must begin with a letter. This heuristic fails for a few common extensions like 7z
, which might be best treated as a special case. Here's a bash/ksh/zsh implementation:
basename=$filename; fileext=
while [[ $basename = ?*.* &&
( ${basename##*.} = [A-Za-z]* || ${basename##*.} = 7z ) ]]
do
fileext=${basename##*.}.$fileext
basename=${basename%.*}
done
fileext=${fileext%.}
For POSIX portability, you need to use a case
statement for pattern matching.
while case $basename in
?*.*) case ${basename##*.} in [A-Za-z]*|7z) true;; *) false;; esac;;
*) false;;
esac
do …
Another possible specification is that some extensions denote encodings and indicate that further stripping is needed. Here's a bash/ksh/zsh implementation (requiring shopt -s extglob
under bash and setopt ksh_glob
under zsh):
basename=$filename
fileext=
while [[ $basename = ?*.@(bz2|gz|lzma) ]]; do
fileext=${basename##*.}.$fileext
basename=${basename%.*}
done
if [[ $basename = ?*.* ]]; then
fileext=${basename##*.}.$fileext
basename=${basename%.*}
fi
fileext=${fileext%.}
Note that this considers 0
to be an extension in file-1.0.gz
.
¹
${VARIABLE##SUFFIX}
and related constructs are in POSIX, so they work in any non-antique Bourne-style shell such as ash, bash, ksh or zsh.
UNIX/Linux does not have the same early DOS / CP/M heritage that Windows does. So extensions are generally less significant to most UNIX utilities and tools.
I usually use a command-line only environment. Extensions in such an environment under Linux aren't really significant except as a convenience to the operator or user. (I don't have enough experience with KDE or GNOME to know how their filemanagers deal with extensions.)
But such convenience is usually important. If config.ini
is really in Microsoft-standard ".ini" format, I'd let the extension stand. Plain old text files usually carry no extension in Linux, but this isn't universal for all programs configuration files. The programmer usually gets to decide that.
I think ".txt" is useful under Linux if you want to emphasize that it's NOT a configuration file or other machine-readable document. However, in source distributions, the convention is to name such files all caps without an extension (i.e. README, INSTALL, COPYING, etc.)
There are some standards and conventions but nothing stopping you from naming anything whatever you want, unless you are sharing things with others.
In Windows, naming a file .exe
indicates to the shell (usually explorer.exe
) that it's an executable file. UNIX builds this knowledge into the file system's permissions. If the proper x
bits (see man chmod
) are set, it is recognized as executable by shells and kernel functions (I believe). Beyond this, Linux doesn't care, most shells won't care, and most programs look in the file to find it's "type."
Of course, there's the nice command file
which can analyze the file and tell you what it is with a degree of certainty. I believe if it can't match the data in the file with any known type, and if it contains only printable ASCII/Unicode characters, then it assumes its a text file.
@Bruce Ediger below is absolutely correct. There is nothing in the kernel or filesystem level, i.e. Linux itself, enforcing or caring that the contents of a file needs to match up with its name, or the program that is supposed to understand it. This doesn't mean it's not possible to create a shell or launcher utility to do things based on filename.
Best Answer
Originally, on unix systems, the extensions on file names were a matter of convention. They allowed a human being to choose the right program to open a file. The modern convention is to use extensions in most cases; common exceptions are:
README
,TODO
. Sometimes there is an additional part that indicate a subcategory, e.g.INSTALL.linux
,INSTALL.solaris
..bashrc
,.profile
,.emacs
.Makefile
.(These are common cases, not hard-and-fast rules.)
Most binary file formats also contain some kind of header that describes properties of the file, and typically allows the file format to be identified through magic numbers. The
file
command looks at this information and shows you its guesses.Sometimes the file extension gives more information than the file format, sometimes it's the other way round. For example many file formats consist of a zip archive: Java libraries (
.jar
), OpenOffice documents (.odt
, …), Microsoft Office document (.docx
, …), etc. Another example is source code files, where the extension indicates the programming language, which can be difficult for a computer to guess automatically from the file contents. Conversely, some extensions are wildly ambiguous, for example.o
is used for compiled code files (object files), but inspection of the file contents usually easily reveals what machine type and operating system the object file is for.An advantage of the extension is that it's a lot faster to recognize it than to open the file and look for magic sequences. For example completion of file names in shells is almost always based on the name (mainly the extension), because reading every file in a large directory can take a long time whereas just reading the file names is fast enough for a Tab press.
Sometimes changing a file's extension can allow you to say how a file is to be interpreted, when two file formats are almost, but not wholly identical. For example a web server might treat
.shtml
and.html
differently, the former undergoing some server-side preprocessing, the latter being served as-is.In the case of gzip archives,
gzip
won't recompress files whose name ends in.gz
,.tgz
and a few other extensions. That way you can rungzip *
to compress every file in a directory, and already compressed files are not modified.