When I try to implement the C string library myself, I found that glibc and the Linux kernel have a different way to implement some functions. For instance, glibc memchr and glibc strchr use some trick to speed up the function but the kernel memchr and the kernel strchr don't. Why aren't the Linux kernel functions optimized like glibc?
Linux Kernel – Why No Optimized Functions Like GLIBC
glibclibrarieslinux-kernel
Related Solutions
Factors like PLT
indirection or syscall()'s
variadic-ness (registers have to be saved to memory) should play little role given that getdents
is one of the most expensive calls in Linux.
Fully reading an empty directory on my machine costs about 5µs, a 100-item directory 37µs, a 1000-item directory 340µs and a 10,000-item directory 3.79ms.
What fdopendir
+readdir
does on top of getdents
is it adds a buffer allocation/deallocation (0.1µs) and a stat
check that the supplied fd is of the directory variety (0.4µs). readdir
then makes one cheap call per directory entry (moves a position in the buffer and possibly refills).
So there is something like a one-time overhead of 0.5µs, which is 10% of the directory scanning time for empty directories, but only 1% for 100-item directories and negligibly little for larger ones. This overhead is 5 times less (only the allocation/deallocation cost) if you you don't need fdopen. (you only need fdopen if you can't diropen
directly and therefore must go through a separately obtained (e.g., openat
'ted) filedescriptor).
So if you use a custom one-time allocated buffer along with getdents
, you can save 2-10% on the scanning cost of empty directories and negligibly little on the bigger ones.
As for the readdir
calls, the cost of PLT indirection on modern hardware is typically less than a 1ns, function call overhead is about 1-2ns. Given that the directory scanning times are in the order of microseconds, you'd need to make at least 1000 readdir
calls for these factors to translate into a single µs, but then the scanning cost is 340µs and the accrued 1 extra µs is like 0.3% of that -- negligible effect. Inlining these readdir
(thus eliminating the call overhead and the PLT overhead) would only serve to expand the code, but it wouldn't much improve performance as getdents
is very much the bottleneck there.
(readdir_r
is more expensive because of additional locking, but you shouldn't need readdir_r
because plain readdir
calls are typically thread-safe unless you have multiple threads calling them on the same directory stream. POSIX might not be explicit about this yet, but I believe this guarantee should be getting standardized soon given that glibc went as far as to deprecate readdir_r
.)
Reason for thin archives
I've pinged Nicholas Piggin by email, who is one of the authors of the patches, and he explained that the thin archives are not just to reduce disk usage, but they can also prevent a link failure.
The problem is that the incremental linked object files could get so large, that the linker cannot insert even trampoline relocations, which must point to generated code that goes between the objects.
I didn't get a reply for the rationale for the incremental builds yet.
This is his awesome reply:
It's a pretty long answer depending on how much you know. There are a few reasons. Stephen's primary motivation for the patch was to allow very large kernels to link successfully.
Some other benefits are:
It is a "nicer" way to store the intermediate build artifacts, you keep the output code in a single place and track them with references (thin archives) until it's all linked together. So there is less IO and disk space required, particularly with big builds and debug info.
For the average modern workstation just building to a small number of output directories, and Linux is not really a huge project, this will all be in cache and the time to incremental link files is very fast. So build speed benefit is usually pretty small for Linux.
It allows the linker to generate slightly better code. By rearranging files and locating linker stubs more optimally.
It tends to work much better with LTO builds, although there's not much support for LTO builds upstream yet.
But we'll get back to the primary motivation.
When you build a relocatable object file that hasn't been finally linked, you have a blob of code with a bunch of references to symbols for functions and variables that are defined elsewhere.
--- a.S --- bl myfunc ---
assembles into
a.o: file format elf64-powerpcle
Disassembly of section .text:
0000000000000000 <.text>: 0: 01 00 00 48 bl 0x0
So the code has a branch to NIA+0 (i.e., itself), which is not what we asked for. Dumping relocations shows the missing bit:
Disassembly of section .text:
0000000000000000 <.text>: 0: 01 00 00 48 bl 0x0 0: R_PPC64_REL24 myfunc
The relocation is not in the .text section, it's not code, but it is some ELF metadata which says the instruction at this location has a 24-bit relative offset to a symbol called myfunc.
When you do a "final link" of objects together, the files are basically concatenated together, and these relocations are resolved by adjusting code and data to point to the correct locations.
Linking a.S with b.S that contains myfunc symbol gives this:
c: file format elf64-powerpcle Disassembly of section .text: 00000000100000d8 <_start>: 100000d8: 05 00 00 48 bl 100000dc <myfunc> 00000000100000dc <myfunc>: 100000dc: 01 00 63 38 addi r3,r3,1 100000e0: 20 00 80 4e blr
Relocation metadata is stripped, branch points to correct offset.
So the linker actually adjusts instructions as it links. It goes one further than that, it generates instructions. If you have a big build and this branch cannot reach myfunc with a 24-bit offset, the linker will put a trampoline (aka stub aka PLT aka procedure linkage table) into the code which can be reached in 24-bits, then the trampoline uses a longer branch that can reach the target.
The linker can't just put these trampolines anywhere in the code, because if you add something in the middle of code, that breaks relative references that go across the middle. The linker does not know about all references in a .o file, only the unresolved ones. So the linker must only put trampolines between .o files when it links them together, before it resolves their references.
The old incremental build approach just combines .o files into bigger .o files as you get closer to the root of the build directory. So you run into a problem when your .o files become so large that the branch can not reach outside its own .o file in order to reach a trampoline. There is no way to resolve this reference.
With thin archives, the final link is done on thousands of very small .o files. This gives the linker maximum flexibility to place these trampolines, which means you never encounter this limitation.
Best Answer
The kernel does have optimised versions of some of these functions, in the arch-specific directories; see for example the x86 implementation of
memchr
(see all thememchr
definitions, and all thestrchr
definitions). The versions you found are the fallback generic versions; you can spot these by looking for the protective check,#ifndef __HAVE_ARCH_MEMCHR
formemchr
and#ifndef __HAVE_ARCH_STRCHR
forstrchr
.The C library’s optimised versions do tend to used more sophisticated code, so the above doesn’t explain why the kernel doesn’t go to such great lengths to go fast. If you can find scenarios where the kernel would benefit from a more optimised version of one of these functions, I imagine a patch would be welcome (with appropriate supporting evidence, and as long as the optimised function is still understandable — see this old discussion regarding
memcpy
). But I suspect the kernel’s uses of these functions often won’t make it worth it; for examplememcpy
and related functions tend to be used on small buffers in the kernel. And never discount the speed gains from a short function which fits in cache or can be inlined...In addition, as mentioned by Iwillnotexist Idonotexist, MMX and SSE can’t easily be used in the kernel, and many optimised versions of memory-searching or copying functions rely on those.
In many cases, the version used ends up being the compiler’s built-in version anyway, and these are heavily optimised, much more so than even the C library’s can be (for example,
memcpy
will often be converted to a register load and store, or even a constant store).