Shell – Handling NULL characters in shell

binaryshell

Is there a portable way of handling NULL characters in shell?

A typical example would be splitting the output of find ... -print0 with shell (and shell only) either in a pipe or in a command substitution result. By portable I mean ideally something that shells not as powerful as e.g. bash or zsh wouldn't choke on. Is this possible in a "bare POSIX shell" (any POSIX version)?

Best Answer

POSIX doesn't envision the standard utilities to deal with text embedding null characters. The -print0 option you use with find is itself a GNU extension unsupported by POSIX.

One way to deal with a flow of data containing nulls with POSIX shell scripting would be to convert it first to real text with od and process that text instead.

In any case, if you have GNU find, you likely have other GNU utilities that haven't that limitation in the first place.

Related Solutions

Why Not to Parse ls Command and What to Use Instead

I am not at all convinced of this, but let's suppose for the sake of argument that you could, if you're prepared to put in enough effort, parse the output of ls reliably, even in the face of an "adversary" — someone who knows the code you wrote and is deliberately choosing filenames designed to break it.

Even if you could do that, it would still be a bad idea.

Bourne shell is not a good language. It should not be used for anything complicated, unless extreme portability is more important than any other factor (e.g. autoconf).

I claim that if you're faced with a problem where parsing the output of ls seems like the path of least resistance for a shell script, that's a strong indication that whatever you are doing is too complicated for shell and you should rewrite the entire thing in Perl or Python. Here's your last program in Python:

import os, sys
for subdir, dirs, files in os.walk("."):
    for f in dirs + files:
      ino = os.lstat(os.path.join(subdir, f)).st_ino
      sys.stdout.write("%d %s %s\n" % (ino, subdir, f))

This has no issues whatsoever with unusual characters in filenames -- the output is ambiguous in the same way the output of ls is ambiguous, but that wouldn't matter in a "real" program (as opposed to a demo like this), which would use the result of os.path.join(subdir, f) directly.

Equally important, and in stark contrast to the thing you wrote, it will still make sense six months from now, and it will be easy to modify when you need it to do something slightly different. By way of illustration, suppose you discover a need to exclude dotfiles and editor backups, and to process everything in alphabetical order by basename:

import os, sys
filelist = []
for subdir, dirs, files in os.walk("."):
    for f in dirs + files:
        if f[0] == '.' or f[-1] == '~': continue
        lstat = os.lstat(os.path.join(subdir, f))
        filelist.append((f, subdir, lstat.st_ino))

filelist.sort(key = lambda x: x[0])
for f, subdir, ino in filelist: 
   sys.stdout.write("%d %s %s\n" % (ino, subdir, f))

Bash Shell Zsh Function – Valid Function Name Characters in Shell

Since POSIX documentation allow it as an extension, there's nothing prevent implementation from that behavior.

A simple check (ran in zsh):

$ for shell in /bin/*sh 'busybox sh'; do
    printf '[%s]\n' $shell
    $=shell -c 'á() { :; }'
  done
[/bin/ash]
/bin/ash: 1: Syntax error: Bad function name
[/bin/bash]
[/bin/dash]
/bin/dash: 1: Syntax error: Bad function name
[/bin/ksh]
[/bin/lksh]
[/bin/mksh]
[/bin/pdksh]
[/bin/posh]
/bin/posh: á: invalid function name
[/bin/yash]
[/bin/zsh]
[busybox sh]
sh: syntax error: bad function name

show that bash, zsh, yash, ksh93 (which ksh linked to in my system), pdksh and its derivation allow multi-bytes characters as function name.

yash is designed to support multibyte characters from the beginning, so there's no surprise it worked.

The other documentation you can refer is ksh93:

A blank is a tab or a space. An identifier is a sequence of letters, digits, or underscores starting with a letter or underscore. Identifiers are used as components of variable names. A vname is a sequence of one or more identifiers separated by a . and optionally preceded by a .. Vnames are used as function and variable names. A word is a sequence of characters from the character set defined by the current locale, excluding non-quoted metacharacters.

So setting to C locale:

$ export LC_ALL=C
$ á() { echo 1; }
ksh: á: invalid function name

make it failed.

Best Answer

Related Solutions

Why Not to Parse ls Command and What to Use Instead

Bash Shell Zsh Function – Valid Function Name Characters in Shell

Related Question