Linux – Performance of calling POSIX-specified functions versus direct Linux kernel calls

glibclinuxperformanceposixsyscalls

In an answer over on Stack Overflow, I provided a code sample to perform some small task referenced in the question. The original question had to do with the fastest-performing technique (so performance criteria are in play, here).

Another commenter/answerer suggested that making a POSIX-defined ~~system~~ API call (in this case, readdir) was not as fast as making a direct system call into the kernel (syscall(SYS_getdents,...)) and the claimed performance difference is in the 25% range. (I didn't implement and re-benchmark; I believe that the performance could in fact be better.)

My question is about the performance characteristics of the proposed syscall-based solution and why they might be faster. I can think of a few reasons why performance might be better:

POSIX readdir is inherently more complicated than syscall(SYS_getdents,...)/getdents()
readdir (which presumably calls syscall(SYS_getdents,...) simply adds the overhead of indirection
readdir only returns one record (per kernel-call) versus syscall(SYS_getdents,...)/getdents()` which returns (presumably) more than one record per kernel-call

I can't imagine that #1 above is true. readdir and getdents are so similar that the implementation of readdir in glibc simply can't have many more "true" system calls than a direct-invocation of syscall(SYS_getdents,...)/getdents() would invoke.

I can't imagine that #2 is true, either, since calling readdir likely wraps getdents and also syscall(SYS_getdents,...) likely calls getdents as well (the proposed answer specifically uses syscall(SYS_getdents,...) instead of calling getdents directly. It's possible that everything within glibc on Linux boils down to syscall(syscallid, args) in which case #2 probably is true.

The last possibility seems to me to be the best explanation: fewer calls into the kernel simply results in faster performance.

Is there any specific explanation for why a "direct kernel call" would be measurably faster than calling a POSIX-defined function?

Best Answer

Factors like PLT indirection or syscall()'s variadic-ness (registers have to be saved to memory) should play little role given that getdents is one of the most expensive calls in Linux.

Fully reading an empty directory on my machine costs about 5µs, a 100-item directory 37µs, a 1000-item directory 340µs and a 10,000-item directory 3.79ms.

What fdopendir+readdir does on top of getdents is it adds a buffer allocation/deallocation (0.1µs) and a stat check that the supplied fd is of the directory variety (0.4µs). readdir then makes one cheap call per directory entry (moves a position in the buffer and possibly refills).

So there is something like a one-time overhead of 0.5µs, which is 10% of the directory scanning time for empty directories, but only 1% for 100-item directories and negligibly little for larger ones. This overhead is 5 times less (only the allocation/deallocation cost) if you you don't need fdopen. (you only need fdopen if you can't diropen directly and therefore must go through a separately obtained (e.g., openat'ted) filedescriptor).

So if you use a custom one-time allocated buffer along with getdents, you can save 2-10% on the scanning cost of empty directories and negligibly little on the bigger ones.

As for the readdir calls, the cost of PLT indirection on modern hardware is typically less than a 1ns, function call overhead is about 1-2ns. Given that the directory scanning times are in the order of microseconds, you'd need to make at least 1000 readdir calls for these factors to translate into a single µs, but then the scanning cost is 340µs and the accrued 1 extra µs is like 0.3% of that -- negligible effect. Inlining these readdir (thus eliminating the call overhead and the PLT overhead) would only serve to expand the code, but it wouldn't much improve performance as getdents is very much the bottleneck there.

(readdir_r is more expensive because of additional locking, but you shouldn't need readdir_r because plain readdir calls are typically thread-safe unless you have multiple threads calling them on the same directory stream. POSIX might not be explicit about this yet, but I believe this guarantee should be getting standardized soon given that glibc went as far as to deprecate readdir_r.)

References

Linux Performance

Linux – Copy a local file to a remote server, non-root privileges over sudo

I have found an elegant way to work around those issues using a named pipe. The idea is to run your tar command on the remote side through a FIFO socket and make that socket readable by your regular user.

Here is an example. On the remote side, you first create the socket:

remote$ sudo -s 
remote# mkfifo -m 600 /home/anarcat/tmpfifo
remote# chown anarcat /home/anarcat/tmpfifo

On the local side, you can already start reading that socket:

local$ ssh example.net "cat tmpfifo" | pv -s 2G | tar xfz -

(The pv -s 2G | part is entirely optional, to get a nice progress bar with pipe viewer.)

Then on the remote side, you can start writing to it:

remote# tar cfz /home/anarcat/tmpfifo files

This will copy files over through that SSH connexion. Of course, you could have also created that tarball directly on the remote server and made it accessible to the user, but then it would require all the space of the archive, which may not be available.

Once this is done, you can remove the fifo as a simple file:

remote$ rm /home/anarcat/tmpfifo

Best Answer

Related Solutions

linux – Diagram of Linux Kernel vs. Performance Tools

References

Linux – Copy a local file to a remote server, non-root privileges over sudo

Related Question