According to the bash
manpage, the LC_COLLATE
environment variable affects character ranges, exactly as per Hauke Laging's answer:
LC_COLLATE
This variable determines the collation order used when sorting the results of pathname expansion, and determines the behavior of
range expressions, equivalence classes, and collating sequences within pathname expansion and pattern matching.
On the other hand, LC_CTYPE
affects character classes:
LC_CTYPE This variable determines the interpretation of characters and the behavior of character classes within pathname expansion and pattern matching.
What this means is that both cases are potentially problematic if you're thinking in a English, left-to-right, Latin alphabet, Arabic-digit context.
If you're really proper, and/or are scripting for a multi-locale environment, it's probably best to make sure you know what your locale variables are when you're matching files, or to be sure that you're coding in a completely generic way.
It's very difficult to foresee some situations though, unless you've studied linguistics.
However, I don't know of a Latin-using locale that changes the order of letters, so [a-z] would work. There are extensions to the Latin alphabet that collate ligatures and diacriticals differently. However, here's a little experiment:
mkdir /tmp/test
cd /tmp/test
export LC_CTYPE=de_DE.UTF-8
export LC_COLLATE=de_DE.UTF-8
touch Grüßen
ls G* # This says ‘Grüßen’
ls *[a-z]en # This says nothing!
ls *[a-zß]en # This says ‘Grüßen’
ls Gr[a-z]*en # This says nothing!
This is interesting: at least for German, neither diacriticals like ü nor ligatures like ß are folded into latin characters. (either that, or I messed up the locale change!)
This may be bad for you, of course, if you're trying to find filenames that start with a letter, use [a-z]*
and apply it to a file that starts with ‘Ä’.
You're assigning files
as a scalar variable instead of an array variable.
In
files=$HOME/print/*.pdf
You're assigning some string like /home/highsciguy/print/*.pdf
to the $files
scalar (aka string) variable.
Use:
files=(~/print/*.pdf)
or
files=("$HOME"/print/*.pdf)
instead. The shell will expand that globbing pattern into a list of file paths, and assign each of them to elements of the $files
array.
The expansion of the glob is done at the time of the assignment.
You don't have to use non-standard sh features, and you could use your system's sh
instead of bash
here by writing it:
#!/bin/sh -
[ "$#" -gt 0 ] || set -- ~/print/*.pdf
for file do
ls -d -- "$file"
done
set
is to assign the "$@"
array of positional parameters.
Another approach could have been to store the globbing pattern in a scalar variable:
files=$HOME/print/*.pdf
And have the shell expand the glob at the time the $files
variable is expanded.
IFS= # disable word splitting
for file in $files; do ...
Here, because $files
is not quoted (which you shouldn't usually do), its expansion is subject to word splitting (which we've disabled here) and globbing/filename generation.
So the *.pdf
will be expanded to the list of matching files. However, if $HOME
contained wildcard characters, they could be expanded too, which is why it's still preferable to use an array variable.
Best Answer
fails for values of
$f1
that start with-
or here for the case ofsort
some that start with+
(can have severe consequences for a file called-o/etc/passwd
for instance).(where
--
signals the end of options) addresses most of those issues but still fails for the file called-
(whichsort
interprets as meaning its stdin instead).Doesn't have those issues.
Here, it's the shell that opens the file. It also means that if the file can't be opened, you'll also get a potentially more useful error message (for instance, most shells will indicate the line number in the script), and the error message will be consistent if you use redirections wherever possible to open files.
And in
(contrary to
sort -- "$f1" > out
), if"$f1"
can't be opened,out
won't be created/truncated andsort
not even run.To clear some possible confusion (following comments below), that does not prevent the command from
mmap()
ing the file orlseek()
ing inside it (not thatsort
does either) provided the file itself is seekable. The only difference is that the file is opened earlier and on file descriptor 0 by the shell as opposed to later by the command possibly on a different file descriptor. The command can still seek/mmap that fd 0 as it pleases. That is not to be confused withcat file | cmd
where this timecmd
's stdin is a pipe that cannot be mmaped/seeked.