When you run ls
without arguments, it will just open a directory, read all the contents, sort them and print them out.
When you run ls *
, first the shell expands *
, which is effectively the same as what the simple ls
did, builds an argument vector with all the files in the current directory and calls ls
. ls
then has to process that argument vector and for each argument, and calls access(2)
¹ the file to check it's existence. Then it will print out the same output as the first (simple) ls
. Both the shell's processing of the large argument vector and ls
's will likely involve a lot of memory allocation of small blocks, which can take some time. However, since there was little sys
and user
time, but a lot of real
time, most of the time would have been spent waiting for disk, rather than using CPU doing memory allocation.
Each call to access(2)
will need to read the file's inode to get the permission information. That means a lot more disk reads and seeks than simply reading a directory. I do not know how expensive these operations are on your GPFS, but as the comparison you've shown to ls -l
which has a similar run time to the wildcard case, the time needed to retrieve the inode information appears to dominate. If GPFS has a slightly higher latency than your local filesystem on each read operation, we would expect it to be more pronounced in these cases.
The difference between the wildcard case and ls -l
of 50% could be explained by the ordering of inodes on the disk. If the inodes were laid out successively in the same order as the filenames in the directory and ls -l
stat(2)ed the files in directory order before sorting, ls -l
would possibly read most of the inodes in a sweep. With the wildcard, the shell will sort the filenames before passing them to ls
, so ls
will likely read the inodes in a different order, adding more disk head movement.
It should be noted that your time
output will not include the time taken by the shell to expand the wildcard.
If you really want to see what's going on, use strace(1)
:
strace -o /tmp/ls-star.trace ls *
strace -o /tmp/ls-l-star.trace ls -l *
and have a look which system calls are being performed in each case.
¹ I don't know if access(2)
is actually used, or something else such as stat(2)
. But both probably require an inode lookup (I'm not sure if access(file, 0)
would bypass an inode lookup.)
I have reformulated your questions a bit, for reasons that should
appear evident when you read them in sequence.
1. Is it possible to config linux filesystem use fixed character encoding to store file names regardless of LANG/LC_ALL environment?
No, this is not possible: as you mention in your question, a UNIX file
name is just a sequence of bytes; the kernel knows nothing about
the encoding, which entirely a user-space (i.e., application-level)
concept.
In other words, the kernel knows nothing about LANG
/LC_*
, so it cannot
translate.
2. Is it possible to let different file names refer to same file?
You can have multiple directory entries referring to the same file;
you can make that through hard links or symbolic links.
Be aware, however, that the file names that are not valid in the
current encoding (e.g., your GBK character string when you're working
in a UTF-8 locale) will display badly, if at all.
3. Is it possible to patch the kernel to translate character encoding between file-system and current environment?
You cannot patch the kernel to do this (see 1.), but you could -in
theory- patch the C library (e.g., glibc) to perform this translation,
and always convert file names to UTF-8 when it calls the kernel, and
convert them back to the current encoding when it reads a file name
from the kernel.
A simpler approach could be to write an overlay filesystem with FUSE,
that just redirects any filesystem request to another location after
converting the file name to/from UTF-8. Ideally you could mount this
filesystem in ~/trans
, and when an access is made to
~/trans/a/GBK/encoded/path
then the FUSE filesystem really accesses
/a/UTF-8/encoded/path
.
However, the problem with these approaches is: what do you do with
files that already exist on your filesystem and are not UTF-8 encoded?
You cannot just simply pass them untranslated, because then you don't
know how to convert them; you cannot mangle them by translating
invalid character sequences to ?
because that could create
conflicts...
Best Answer
The answer, as often, is “it depends”.
Looking at the NTFS implementation in particular, it reports a maximum file name length of 255 to
statvfs
callers, so callers which interpret that as a 255-byte limit might pre-emptively avoid file names which would be valid on NTFS. However, most programs don’t check this (or evenNAME_MAX
) ahead of time, and rely onENAMETOOLONG
errors to catch errors. In most cases, the important limit isPATH_MAX
, notNAME_MAX
; that’s what’s typically used to allocate buffers when manipulating file names (for programs that don’t allocate path buffers dynamically, as expected by OSes like the Hurd which doesn't have arbitrary limits).The NTFS implementation itself doesn’t check file name lengths in bytes, but always as 2-byte characters; file names which can’t be represented in an array of 255 2-byte elements will cause a
ENAMETOOLONG
error.Note that NTFS is generally handled by a FUSE driver on Linux. The kernel driver currently only supports UCS-2 characters, but the FUSE driver supports UTF-16 surrogate pairs (with the corresponding reduction in character length).