Linux – Temporarily cache and write-buffer a directory (to speed up a build process on an NFS share)

cachefilesystemslinuxnfstmpfs

Overview

This question is structured as follows:
I first give some background on why I am interested in this topic and how it would solve a problem I am dealing with.
Then, I ask the actual standalone question regarding file system caching, so if you are not interested in the motivation (some C++ project build setup), just skip the first section.

The initial problem: Linking shared libraries

I am looking for a way to speed up our project's build times. The setup is as follows: A directory (lets call it workarea) is located in an NFS share.
It initally only contains source code and makefiles. Then, the build process first creates static libraries in workarea/lib and then creates shared libraries in workarea/dll, using the static libraries in workarea/lib. During the creation of the shared libraries, these are not only written, but also again read using e.g. nm to verify at link time that no symbols are missing. Using many jobs in parallel, (e.g. make -j 20 or make -j 40), build times are quickly dominated by linking time. In this case, linking performance is limited by file system performance. For example, linking with 20 jobs in parallel roughly takes 35 seconds in the NFS share, but only 5 seconds in a RAM drive. Note that using rsync to copy dll back to the NFS share takes another 6 seconds, so working in a RAM drive and syncing to NFS afterwards is much faster than directly working in the NFS share. I am looking for a way to achieve the fast performance without explicitly copying / linking files between the NFS share and the RAM drive.
Note that our NFS share already uses a cache, but this cache can only cache read accesses.
AFAIK, NFS requires that any NFS client may not confirm a write until the NFS server confirms the completion of the write, so the client cannot use a local write buffer and write throughput is (even in spikes) limited by network speed. This effectively limits combined write throughput to roughly 80MB/s in our setup.
Read performance, however, is much better, as a read cache is used. If I do linking (create the content of dll) with workarea/lib in NFS and workarea/dll being a symlink to the RAM drive, performance is still good – roughly 5 seconds. Note that it is required for the build process to finish with workarea/* residing in the NFS share: lib needs to be in the share (or any persistent mount) to allow fast incremental builds, and dll needs to be in NFS to be accessed by compute machines starting jobs using these dlls.
Hence, I would like to apply a solution to the problem below to workarea/dll and maybe also workarea/lib (the latter in order to improve compilation times). The requirement of fast setup times below is caused by the necessity to perform fast incremental builds, only copying data if required.

Update

I should probably have been a bit more specific about the build setup. Here are some more details: Compilation units are compiled into .o files in a temporary directory (in /tmp). These are then merged into static libraries in lib using ar. The complete build process is incremental:

  • Compilation units are only recompiled if the compilation unit itself (the .C file) or an included header has changed (using compiler-generated dependency files which are included into make).
  • Static libraries are only updated if one of its compilation units has been recompiled.
  • Shared libraries are only relinked if one of its static libraries has changed. Symbols of shared libraries are only re-checked if the symbols provided by the shared libraries it depends on changed of if the shared library itself has been updated.

Still, complete or near-complete rebuilds are necessary quite often, since multiple compilers (gcc, clang), compiler versions, compilation modes (release, debug), C++ standards (C++97, C++11) and additional modifications (e.g. libubsan) may be used. All combinations effectively use different lib and dll directories, so one can switch between setups and build incrementally based on the last build for that very setup. Also, for incremental builds, often just a few files have to be recompiled, taking very little time, but triggering relinkage of (possibly large) shared libraries, taking much longer.

Update 2

In the meantime I learned about the nocto NFS mount option, which apparently could solve my problem on basically all NFS implementations except for Linux's, since Linux flushes write buffers always on close(), even with nocto. We already tried several other things: For example, we could use another local NFS server with async enabled that serves as a write buffer and exports the main NFS mount, but unfortunately the NFS server itself does no write buffering in this case. It seems that async just means that the server does not force its underlying file system to flush to stable storage, and a write buffer is used implicitly in the case the underlying file system uses a write buffer (as it apparently is the case for the file system on the physical drive).
We even thought about the option to use a non-Linux virtual machine on the same box that mounts the main NFS share using nocto, providing a write buffer, and providing this buffered mount via another NFS server, but have not tested it and would like to avoid such a solution.
We also found several FUSE-based file system wrappers serving as caches, but none of these implemented write buffering.

Caching and buffering a directory

Consider some directory, lets call it orig, which resides in a slow file system, e.g. an NFS share. For a short timespan (e.g. seconds or minutes, but this should not matter anyway), I would like to create a fully cached and buffered view of orig using a directory cache, which resides in a fast file system, e.g. a local hard drive or even a RAM drive. The cache should be accessible via e.g. a mount cached_view and not require root privileges. I assume that for the lifetime of the cache, there are no read or write accesses directly to orig (beside the cache itself of course).
By fully cached and buffered, I mean the following:

  1. Read queries are answered by once forwarding the query to the file system of orig, caching that result and using it from then on, and
  2. Write queries are written into cache and confirmed upon completion directly, i.e. the cache is also a write buffer. This should even happen when close() is called on the written file. Then, in the background, the writes are forwarded (maybe using a queue) to orig. Read queries to written data are answered using the data in cache, of course.

Furthermore, I need:

  1. The cache provides a function to shut the cache down, which flushes all writes to orig. Runtime of flushing should only depend on the size of written files, not all files. Afterwards, one could safely access orig again.
  2. Setup time is fast, e.g. initialization of the cache may only depend on the number of files in orig, but not on the size of files in orig, so copying orig to cache once is not an option.

Finally, I would also be fine with a solution that does not use another file system as cache, but just caches in main memory (the servers have plenty of RAM). Note that using builtin caches of e.g. NFS is not an option, as AFAIK NFS does not allow write buffers (c.f. first section).

In my setup, I could emulate a slightly worse behavior by symlinking the contents of orig to cache, then work with cache (as all write operations actually replace files by new files, in which case the symlinks are replaced by the updated versions), and rsyncing the modified files to orig afterwards.
This does not exactly meet the requirements above, e.g. reads are not done only once and files are replaced by symlinks, which of course does make a difference for some applications.
I assume this is not the correct way to solve this (even in my simpler setting), and maybe someone is aware of a cleaner (and faster!) solution.

Best Answer

Wow, surprised nobody answered "overlayfs" yet.

Actually I have two suggestions. The first is to use overlayfs, which is basically exactly what you're describing, with one caveat. Overlayfs (standard since Linux 3.18 or so) lets you read from two virtually-merged directory trees while writing to only one of them. What you would do is take the fast storage (like tmpfs) and overlay it onto the NFS volume, then perform your compilation in the overlayed merger of the two. When you are done, there have been zero writes to any file on NFS, and the other filesystem is holding all your changes. If you want to keep the changes, you can just rsync them back to the NFS. You could even exclude files that you don't care about, or just hand-pick a few files out of the result.

You can see a relatively simple example of overlayfs in a little project of mine: https://github.com/nrdvana/squash-portage/blob/master/squash-portage.sh That script also shows how to use UnionFS in case you're on an older kernel that doesn't have overlayfs.

In my case, the rsync command used by Gentoo to update its software library takes an insanely long time since it has millions of tiny disk writes. I use overlayfs to write all the changes to tmpfs, and then I mksquashfs to build a compressed image of the tree. Then I throw the tmpfs away and mount the compressed image in its place.

My second suggestion is an "out of tree" build. The idea being you have the source code and makefiles in one tree, and you tell automake to generate all its intermediate files in a separate tree that mirrors the first.

If you're lucky, your build tool (automake or whatnot) can already do this. If you're not lucky you might have to spend some headaches tinkering with your makefiles.

Related Question