I am not at all convinced of this, but let's suppose for the sake of argument that you could, if you're prepared to put in enough effort, parse the output of ls
reliably, even in the face of an "adversary" — someone who knows the code you wrote and is deliberately choosing filenames designed to break it.
Even if you could do that, it would still be a bad idea.
Bourne shell is not a good language. It should not be used for anything complicated, unless extreme portability is more important than any other factor (e.g. autoconf
).
I claim that if you're faced with a problem where parsing the output of ls
seems like the path of least resistance for a shell script, that's a strong indication that whatever you are doing is too complicated for shell and you should rewrite the entire thing in Perl or Python. Here's your last program in Python:
import os, sys
for subdir, dirs, files in os.walk("."):
for f in dirs + files:
ino = os.lstat(os.path.join(subdir, f)).st_ino
sys.stdout.write("%d %s %s\n" % (ino, subdir, f))
This has no issues whatsoever with unusual characters in filenames -- the output is ambiguous in the same way the output of ls
is ambiguous, but that wouldn't matter in a "real" program (as opposed to a demo like this), which would use the result of os.path.join(subdir, f)
directly.
Equally important, and in stark contrast to the thing you wrote, it will still make sense six months from now, and it will be easy to modify when you need it to do something slightly different. By way of illustration, suppose you discover a need to exclude dotfiles and editor backups, and to process everything in alphabetical order by basename:
import os, sys
filelist = []
for subdir, dirs, files in os.walk("."):
for f in dirs + files:
if f[0] == '.' or f[-1] == '~': continue
lstat = os.lstat(os.path.join(subdir, f))
filelist.append((f, subdir, lstat.st_ino))
filelist.sort(key = lambda x: x[0])
for f, subdir, ino in filelist:
sys.stdout.write("%d %s %s\n" % (ino, subdir, f))
Since POSIX documentation allow it as an extension, there's nothing prevent implementation from that behavior.
A simple check (ran in zsh
):
$ for shell in /bin/*sh 'busybox sh'; do
printf '[%s]\n' $shell
$=shell -c 'á() { :; }'
done
[/bin/ash]
/bin/ash: 1: Syntax error: Bad function name
[/bin/bash]
[/bin/dash]
/bin/dash: 1: Syntax error: Bad function name
[/bin/ksh]
[/bin/lksh]
[/bin/mksh]
[/bin/pdksh]
[/bin/posh]
/bin/posh: á: invalid function name
[/bin/yash]
[/bin/zsh]
[busybox sh]
sh: syntax error: bad function name
show that bash
, zsh
, yash
, ksh93
(which ksh
linked to in my system), pdksh
and its derivation allow multi-bytes characters as function name.
yash
is designed to support multibyte characters from the beginning, so there's no surprise it worked.
The other documentation you can refer is ksh93
:
A blank is a tab or a space. An identifier is a sequence of
letters, digits, or underscores starting with a letter or
underscore. Identifiers are used as components of variable names. A vname is a sequence of one or more identifiers separated by a . and optionally preceded by a .. Vnames are used as function and variable names. A word is a sequence of
characters from the character set defined by the current locale, excluding non-quoted metacharacters.
So setting to C
locale:
$ export LC_ALL=C
$ á() { echo 1; }
ksh: á: invalid function name
make it failed.
Best Answer
POSIX doesn't envision the standard utilities to deal with text embedding
null
characters. The-print0
option you use withfind
is itself aGNU
extension unsupported byPOSIX
.One way to deal with a flow of data containing
null
s withPOSIX
shell scripting would be to convert it first to real text withod
and process that text instead.In any case, if you have
GNU find
, you likely have otherGNU
utilities that haven't that limitation in the first place.