I have reformulated your questions a bit, for reasons that should
appear evident when you read them in sequence.
1. Is it possible to config linux filesystem use fixed character encoding to store file names regardless of LANG/LC_ALL environment?
No, this is not possible: as you mention in your question, a UNIX file
name is just a sequence of bytes; the kernel knows nothing about
the encoding, which entirely a user-space (i.e., application-level)
concept.
In other words, the kernel knows nothing about LANG
/LC_*
, so it cannot
translate.
2. Is it possible to let different file names refer to same file?
You can have multiple directory entries referring to the same file;
you can make that through hard links or symbolic links.
Be aware, however, that the file names that are not valid in the
current encoding (e.g., your GBK character string when you're working
in a UTF-8 locale) will display badly, if at all.
3. Is it possible to patch the kernel to translate character encoding between file-system and current environment?
You cannot patch the kernel to do this (see 1.), but you could -in
theory- patch the C library (e.g., glibc) to perform this translation,
and always convert file names to UTF-8 when it calls the kernel, and
convert them back to the current encoding when it reads a file name
from the kernel.
A simpler approach could be to write an overlay filesystem with FUSE,
that just redirects any filesystem request to another location after
converting the file name to/from UTF-8. Ideally you could mount this
filesystem in ~/trans
, and when an access is made to
~/trans/a/GBK/encoded/path
then the FUSE filesystem really accesses
/a/UTF-8/encoded/path
.
However, the problem with these approaches is: what do you do with
files that already exist on your filesystem and are not UTF-8 encoded?
You cannot just simply pass them untranslated, because then you don't
know how to convert them; you cannot mangle them by translating
invalid character sequences to ?
because that could create
conflicts...
e4fsck
supports -D
flag which seems to do what you want:
try to optimize all directories, either by reindexing them if the filesystem supports directory indexing, or by sorting and compressing directories for smaller directories, or for filesystems using traditional linear directories.
Of course, you'll need to unmount the filesystem to use fsck
, meaning downtime for your server.
You'll want to use the -f
option to make sure e4fsck
processes the file system even if clean.
Testing:
# truncate -s1G a; mkfs.ext4 -q ./a; mount ./a /mnt/1
# mkdir /mnt/1/x; touch /mnt/1/x/{1..4000}
# ls -ld /mnt/1/x
drwxr-xr-x 2 root root 69632 Nov 22 12:54 /mnt/1/x/
# rm -f /mnt/1/x/*
# ls -ld /mnt/1/x
drwxr-xr-x 2 root root 69632 Nov 22 12:55 /mnt/1/x/
# umount /mnt/1
# e2fsck -f -D ./a
e2fsck 1.43.3 (04-Sep-2016)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 3A: Optimizing directories
Pass 4: Checking reference counts
Pass 5: Checking group summary information
./a: ***** FILE SYSTEM WAS MODIFIED *****
./a: 12/65536 files (0.0% non-contiguous), 12956/262144 blocks
# mount ./a /mnt/1
# ls -ld /mnt/1/x
drwxr-xr-x 2 root root 4096 Nov 22 12:55 /mnt/1/x/
Best Answer
The
ls
command, or even TAB-completion or wildcard expansion by the shell, will normally present their results in alphanumeric order. This requires reading the entire directory listing and sorting it. With ten million files in a single directory, this sorting operation will take a non-negligible amount of time.If you can resist the urge of TAB-completion and e.g. write the names of files to be zipped in full, there should be no problems.
Another problem with wildcards might be wildcard expansion possibly producing more filenames than will fit on a maximum-length command line. The typical maximum command line length will be more than adequate for most situations, but when we're talking about millions of files in a single directory, this is no longer a safe assumption. When a maximum command line length is exceeded in wildcard expansion, most shells will simply fail the entire command line without executing it.
This can be solved by doing your wildcard operations using the
find
command:or a similar syntax whenever possible. The
find ... -exec ... \+
will automatically take into account the maximum command line length, and will execute the command as many times as required while fitting the maximal amount of filenames to each command line.