Short answer: restrictions imposed in Unix/Linux/BSD kernel, namei()
function. Encoding takes place in user level programs like xterm
, firefox
or ls
.
I think you're starting from incorrect premises. A file name in Unix is a string of bytes with arbitrary values. A few values, 0x0 (ASCII Nul) and 0x2f (ASCII '/') are just not allowed, not as part of a multi-byte character encoding, not as anything. A "byte" can contain a number representing a character (in ASCII and some other encodings) but a "character" can require more than 1 byte (for example, code points above 0x7f in UTF-8 representation of Unicode).
These restrictions arise from file name printing conventions and the ASCII character set. The original Unixes used ASCII '/' (numerically 0x2f) valued bytes to separate pieces of a partially- or fully-qualified path (like '/usr/bin/cat' has pieces "usr", "bin" and "cat"). The original Unixes used ASCII Nul to terminate strings. Other than those two values, bytes in file names may assume any other value. You can see an echo of this in the UTF-8 encoding for Unicode. Printable ASCII characters, including '/', take only one byte in UTF-8. UTF-8 for code points above does not include any Zero-valued bytes, except for the Nul control character. UTF-8 was invented for Plan-9, The Pretender to the Throne of Unix.
Older Unixes (and it looks like Linux) had a namei()
function that just looks at paths a byte at a time, and breaks the paths into pieces at 0x2F valued bytes, stopping at a zero-valued byte. namei()
is part of the Unix/Linux/BSD kernel, so that's where the exceptional byte values get enforced.
Notice that so far, I've talked about byte values, not characters. namei()
does not enforce any character semantics on the bytes. That's up to the user-level programs, like ls
, which might sort file names based on byte values, or character values. xterm
decides what pixels to light up for file names based on the character encoding. If you don't tell xterm
you've got UTF-8 encoded filenames, you'll see a lot of gibberish when you invoke it. If vim
isn't compiled to detect UTF-8 (or whatever, UTF-16, UTF-32) encodings, you'll see a lot of gibberish when you open a "text file" containing UTF-8 encoded characters.
You can install the Perl script rename
. Then try doing this :
$ rename -n 's/[A-Z]/lc($&)/ge; s/\s/_/g' files*
(remove the -n
switch when your tests are OK)
There are two utilities called rename
. The one in Fedora can't do this. Some other distributions come with the Perl one by default. If you run the following command (GNU
)
$ file "$(readlink -f "$(type -p rename)")"
and you have a result like
.../rename: Perl script, ASCII text executable
and not containing:
ELF
then this seems to be the right tool =)
If not, such as on Fedora, install it manually.
Last but not least, this tool was originally written by Larry Wall, Perl's dad.
Best Answer
Spaces, and indeed every character except
/
and NUL, are allowed in filenames. The recommendation to not use spaces in filenames comes from the danger that they might be misinterpreted by software that poorly supports them. Arguably, such software is buggy. But also arguably, programming languages like shell scripting make it all too easy to write software that breaks when presented with filenames with spaces in them, and these bugs tend to slip through because shell scripts are not often tested by their developers using filenames with spaces in them.Spaces replaced with
%20
is not often seen in filenames. That's mostly used for (web) URLs. Though it's true that %-encoding from URLs sometimes makes its way into filenames, often by accident.