Effect of file organization on efficiency of file access

The organization of files in a directory structure can markedly impact the efficiency of file access (ref). For the sake of example, consider two directories, A and B, each containing 10^6 files, organized in one directory in the former case and in 10^3 subdirectories in the latter. Counting or listing all files is reproducibly slower in the former case. On my system:

Create files:

$ mkdir A; pushd A; seq -w 1E6 | xargs touch; popd
$ mkdir B; pushd B; seq -w 1E3 | xargs mkdir; for d in *; do pushd "$d"; seq -w 1E3 | xargs touch; popd; done; popd

List files:

$ for d in A B; do
    time for i in {1..100}; do
        {
            echo 3 | sudo tee /proc/sys/vm/drop_caches
            find "$d" -type f
        }
    done >/dev/null
done

# case A (0 subdirectories)
real    4m49.805s
user    1m43.696s
sys     1m13.540s

# case B (1000 subdirectories)
real    3m32.338s
user    0m40.824s
sys     1m13.176s

The difference is reproducible, independent of disk caching, and not specific to the find command (i.e., a difference of the same magnitude is found using ls -RU). The amount of time in kernel space is the same in both cases, exonerating the kernel (and likely filesystem mechanics as well). Though I haven't done any profiling, the main syscalls being made are almost certainly readdir() and getdents() and since the number of inodes is the same in both cases (to within 0.1%), as is the size of the files, the amount of time required for the execution of these calls by the kernel would not be expected to differ. The difference in execution time is therefore attributable to user space code.

Thread support has been added to some GNU coreutils (e.g., sort). As my system has four hardware threads and I am not sure whether GNU find (my system's version is 4.7.0-git) has been imbued with multithreaded capabilities, I verified persistence of the discrepancy with processes that were explicitly bound to a single hardware thread:

$ cat find.sh
#!/bin/bash
for i in {1..100}; do
  {
    echo 3 | sudo tee /proc/sys/vm/drop_caches
    find "$1" -type f
  }
done >/dev/null

$ for d in A B; do time taskset -c 0 ./find.sh "$d"; done

# case A (0 subdirectories)
real    4m7.827s
user    1m31.080s
sys     0m55.660s

# case B (1000 subdirectories)
real    2m53.059s
user    0m33.256s
sys     0m52.772s

Thus, my original question can be refined as follows: what are the user space inefficiencies that underlie the disparity in access times stemming purely from differences in filesystem organization? Knowledge of these inefficiencies should permit better implementations of file access routines.

Edit:
I am using an ext4 filesystem on a machine running linux kernel 4.9.0-4-amd64, though I would be interested to know to what extent the answer depends upon the filesystem chosen.

Best Answer

Without much evidence, I'm going to guess that find ends up invoking realloc and copying the directory data several times as it runs of space in an initial allocation in the single-directory case. But in the multi-directory case, the memory backing each sub-directory read does not require as many copies or reallocs. You might check the overall memory usage of find (somehow?) to verify.

Best Answer

Related Solutions

Bash – Why is deleting files by name painfully slow and also exceptionally fast

Distributing updates to VirtualBox images with bandwidth efficiency

Related Question