Shell – Command line processing – tokens and metacharacters

shell

I am just learning about command line processing, and hoping someone can confirm how I am interpreting the following statement. In the book I am reading the first step in command line processing is:

Splits the command into tokens that are separated by the fixed set of metacharacters: SPACE, TAB, NEWLINE, ;, (, ), <, >, |, and &. Types of tokens include words, keywords, I/O redirectors, and semicolons

Am I right in thinking that for the command:

ls | more

ls and more are the tokens, and the pipe character is the meta character separating the two tokens?

I got bit confused as it goes on to say that < and > are meta characters, but then says that tokens can be I/O directors.

Best Answer

This is not a very good explanation. A token is a sequence of characters that forms a word or punctuation sign. Characters like < and | are part of tokens too. You may call them metacharacters but this is not useful terminology. The basic rules are:

Whitespace is not part of a token and separates tokens.
A token is made up of ordinary characters, or of operator characters ()<>&|;, but not both. For example, foo<@a&>b consists of the tokens foo (ordinary), < (operator), @a (ordinary), &> (operator) and b.

Then there are additional rules about quoting: special characters lose their meaning if they're quotes, with different rules depending on the type of quote. For example, foo'&&'bar\|qux is a single token with the character sequence foo&&bar|qux.

Related Solutions

Shell – Why there are multiple shells in a Unix like system

Most shells used in modern UNIX environments are meant to conform to the POSIX sh specification. POSIX sh is derived from the original Korn shell (ksh88), which is in turn derived from the earlier Bourne shell, but POSIX sh only specifies a small subset of even ksh88's functionality. A shell that only implements the minimum requirement is missing many features required for writing all but the most trivial of scripts in a safe and reasonable manner. For example, local variables and arrays are non-standard extras.

Therefore, the first reason is to extend the shell with extra features. Different shells choose to focus on different things. For example, Zsh focuses on advanced interactive features while ksh93 (the current "original" korn shell) focuses on powerful programming features and performance. Even very minimal shells like Dash add at least a few non-standard extras like local variables.

Extra features are rarely widely interoperable, if at all. Most of the ksh88 featureset is fairly well interoperable such as the extended globbing syntax, but with non-standard features, there are no guarantees, and you must really know what you're doing to use them in a portable way.

The second reason is legacy. There are still a lot of proprietary Unixes out there that use ancient non-standard implementations for their /bin/sh. Until recently, Solaris still used Bourne as their defuault and chose to maintain the Heirloom shell rather than upgrade to something modern. These systems usually come with different shells you can switch to, for instance by changing your PATH variable or altering shebangs within individual scripts.

So to summarize. There are multiple shells, often by default:

For extra features, especially for dealing with non-portable extras.
To handle legacy scripts which are often unmaintained.
size / performance. Embedded systems often require small shells like mksh or busybox sh.
Licensing reasons. AT&T ksh was proprietary software until around 2000 or so. This is largely what gave rise to all the ksh-like clones such as Zsh and Bash.
Other historical reasons. Though not very popular today, there have been radical attempts at redesigning the language, such as scsh and es. The process substitution feature of many shells originally comes from rc (with a bit different syntax), and brace expansion from csh. Different shells have different combinations of such features available, usually with some subtle or not so subtle differences.

Bash – Why Prefer Character Classes Over Character Ranges

According to the bash manpage, the LC_COLLATE environment variable affects character ranges, exactly as per Hauke Laging's answer:

LC_COLLATE This variable determines the collation order used when sorting the results of pathname expansion, and determines the behavior of range expressions, equivalence classes, and collating sequences within pathname expansion and pattern matching.

On the other hand, LC_CTYPE affects character classes:

LC_CTYPE This variable determines the interpretation of characters and the behavior of character classes within pathname expansion and pattern matching.

What this means is that both cases are potentially problematic if you're thinking in a English, left-to-right, Latin alphabet, Arabic-digit context.

If you're really proper, and/or are scripting for a multi-locale environment, it's probably best to make sure you know what your locale variables are when you're matching files, or to be sure that you're coding in a completely generic way.

It's very difficult to foresee some situations though, unless you've studied linguistics.

However, I don't know of a Latin-using locale that changes the order of letters, so [a-z] would work. There are extensions to the Latin alphabet that collate ligatures and diacriticals differently. However, here's a little experiment:

mkdir /tmp/test
cd /tmp/test
export LC_CTYPE=de_DE.UTF-8
export LC_COLLATE=de_DE.UTF-8
touch Grüßen
ls G* # This says ‘Grüßen’
ls *[a-z]en # This says nothing!
ls *[a-zß]en # This says ‘Grüßen’
ls Gr[a-z]*en # This says nothing!

This is interesting: at least for German, neither diacriticals like ü nor ligatures like ß are folded into latin characters. (either that, or I messed up the locale change!)

This may be bad for you, of course, if you're trying to find filenames that start with a letter, use [a-z]* and apply it to a file that starts with ‘Ä’.

Best Answer

Related Solutions

Shell – Why there are multiple shells in a Unix like system

Bash – Why Prefer Character Classes Over Character Ranges

Related Question