Linux – `umount -R` on bind mounts takes a non-neglible amount of time, why

bind-mountlinuxmountnamespaceperformance

Why does umount -R take 0.2s to unmount this subtree of bind mounts? It only took 0.02s to mount the subtree, and 0.00s to change its propagation flags.

(I was looking at cloning the current set of mounts, to a subdirectory, changing them, then switching to them using pivot_mount. But the delay I'm observing would not really be acceptable for the purpose).

This exercise assumes / and sub-mounts are shared mounts. Linux doesn't make them so by default, but systemd does.

# mkdir /mnt/a
# mount --bind /mnt/a /mnt/a --make-private
# time mount --rbind / /mnt/a
0.00user 0.00system 0:00.02elapsed 9%CPU (0avgtext+0avgdata 3020maxresident)k
0inputs+0outputs (0major+135minor)pagefaults 0swaps
# time mount --make-rprivate /mnt/a
0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 3184maxresident)k
0inputs+0outputs (0major+136minor)pagefaults 0swaps
# time umount -R /mnt/a
0.00user 0.00system 0:00.19elapsed 9%CPU (0avgtext+0avgdata 3392maxresident)k
0inputs+0outputs (0major+194minor)pagefaults 0swaps

Further tests

Running under strace -cw shows

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 90.44    0.180113        5297        34           umount2
 ...

So it's not super revealing, other than pointing out that this last operation requires 34 individual umount2() calls, whereas both the others only include one call to mount(), using the MS_REC (recursive) flag. As with the headline figures from time, this is wall-clock time. strace -c shows system time (i.e. cpu time spent in the kernel), which only has a total of 0.009s.

Although it does point to something interesting. Using umount -l /mnt/a instead, reduces the total time to 0.02s. This uses a single umount2() call to detach the subtree /mnt/a (and do any clean up in the background).

Looking at the individual calls with strace -ttt -T -e trace=umount2 umount -R /mnt/a, the distribution of times for individual calls is relatively even; they range from 0.002s to 0.012s, but there is no clear pattern, and the pattern does not look consistent if I repeat it.

After running umount -R under perf record -a, perf report shows several hotspots in gsd-housekeeping, gvfs-udisks2-volume-monitor, and systemd. The umount process doesn't show up at all. This could explain why time showed umount spending neglible cpu time either in the kernel or in userspace.

(I'd be very interested if anyone has a more comprehensive way to aggregate cpu usage per process during a test :).

The other processes are probably doing some processing in response to each mount event.

E.g. systemd seems to be responsible for using 0.13s of one of my four cpus, during a run that took 0.4s:

# systemctl set-property init.scope CPUAccounting=yes
# systemctl show --property CPUUsageNSec init.scope; time umount -R /mnt/a ; systemctl show --property CPUUsageNSec init.scope
CPUUsageNSec=2403124481

real    0m0.408s
user    0m0.015s
sys 0m0.020s
CPUUsageNSec=2534058385

# echo $(( 2534058385 - 2403124481 ))
130933904

But this does not seem to be the right explanation, because the same delay is happening when running in a private mount namespace. In this case, perf record -a does not show the other processes, only the umount process (and perf itself).

# unshare -m
# time mount --rbind / /mnt/a

real    0m0.005s
user    0m0.003s
sys 0m0.002s
# time mount --make-rprivate /mnt/a

real    0m0.005s
user    0m0.003s
sys 0m0.002s
# systemctl show --property CPUUsageNSec init.scope; time umount -R /mnt/a ; systemctl show --property CPUUsageNSec init.scope
CPUUsageNSec=3637792026

real    0m0.381s
user    0m0.026s
sys 0m0.018s
CPUUsageNSec=3645973005
# echo $((3645973005-3637792026))
8180979

The cpu doesn't seem to be relevant in this case. I have 4 cpu cores capable of running at 2.3Ghz, but perf stat -a shows less than 5% cpu usage overall. (Ignore "CPUs utilized, I think that always shows the full value when -a is used).

# time perf stat -a umount -R /mnt/a

 Performance counter stats for 'system wide':

       2079.333650      cpu-clock (msec)          #    3.998 CPUs utilized          
               635      context-switches          #    0.305 K/sec                  
                23      cpu-migrations            #    0.011 K/sec                  
               333      page-faults               #    0.160 K/sec                  
       198,278,822      cycles                    #    0.095 GHz                    
       138,734,277      instructions              #    0.70  insn per cycle         
        31,401,067      branches                  #   15.102 M/sec                  
           934,327      branch-misses             #    2.98% of all branches        

       0.520083596 seconds time elapsed


real    0m0.543s
user    0m0.038s
sys 0m0.043s

However there is still some process responding to this event… The umount still triggers 78 lines of messages in the system log.

Feb 07 10:34:26 alan-laptop systemd[1]: proc-sys-fs-binfmt_misc.automount: Got automount request for /proc/sys/fs/binfmt_misc, triggered by 6040 (umount)
Feb 07 10:34:26 alan-laptop systemd[1]: proc-sys-fs-binfmt_misc.automount: Automount point already active?
Feb 07 10:34:26 alan-laptop systemd[1]: proc-sys-fs-binfmt_misc.automount: Got automount request for /proc/sys/fs/binfmt_misc, triggered by 6040 (umount)
Feb 07 10:34:26 alan-laptop systemd[1]: proc-sys-fs-binfmt_misc.automount: Automount point already active?

Note findmnt shows I avoided creating any horrific recursive-like propagation, e.g. if I run it after the --make-rprivate:

findmnt -o TARGET,PROPAGATION
TARGET                                          PROPAGATION
/                                               shared
├─/sys                                          shared
│ ├─/sys/kernel/security                        shared
│ ├─/sys/fs/cgroup                              shared
│ │ ├─/sys/fs/cgroup/unified                    shared
│ │ ├─/sys/fs/cgroup/systemd                    shared
│ │ ├─/sys/fs/cgroup/net_cls,net_prio           shared
│ │ ├─/sys/fs/cgroup/cpu,cpuacct                shared
│ │ ├─/sys/fs/cgroup/devices                    shared
│ │ ├─/sys/fs/cgroup/freezer                    shared
│ │ ├─/sys/fs/cgroup/perf_event                 shared
│ │ ├─/sys/fs/cgroup/hugetlb                    shared
│ │ ├─/sys/fs/cgroup/memory                     shared
│ │ ├─/sys/fs/cgroup/blkio                      shared
│ │ ├─/sys/fs/cgroup/cpuset                     shared
│ │ └─/sys/fs/cgroup/pids                       shared
│ ├─/sys/fs/pstore                              shared
│ ├─/sys/fs/selinux                             shared
│ ├─/sys/kernel/debug                           shared
│ └─/sys/kernel/config                          shared
├─/proc                                         shared
│ └─/proc/sys/fs/binfmt_misc                    shared
├─/dev                                          shared
│ ├─/dev/shm                                    shared
│ ├─/dev/pts                                    shared
│ ├─/dev/mqueue                                 shared
│ └─/dev/hugepages                              shared
├─/run                                          shared
│ ├─/run/user/1000                              shared
│ └─/run/user/42                                shared
├─/usr                                          shared
├─/tmp                                          shared
├─/boot                                         shared
└─/mnt/a                                        private
  └─/mnt/a                                      private
    ├─/mnt/a/usr                                private
    ├─/mnt/a/sys                                private
    │ ├─/mnt/a/sys/kernel/security              private
    │ ├─/mnt/a/sys/fs/cgroup                    private
    │ │ ├─/mnt/a/sys/fs/cgroup/unified          private
    │ │ ├─/mnt/a/sys/fs/cgroup/systemd          private
    │ │ ├─/mnt/a/sys/fs/cgroup/net_cls,net_prio private
    │ │ ├─/mnt/a/sys/fs/cgroup/cpu,cpuacct      private
    │ │ ├─/mnt/a/sys/fs/cgroup/devices          private
    │ │ ├─/mnt/a/sys/fs/cgroup/freezer          private
    │ │ ├─/mnt/a/sys/fs/cgroup/perf_event       private
    │ │ ├─/mnt/a/sys/fs/cgroup/hugetlb          private
    │ │ ├─/mnt/a/sys/fs/cgroup/memory           private
    │ │ ├─/mnt/a/sys/fs/cgroup/blkio            private
    │ │ ├─/mnt/a/sys/fs/cgroup/cpuset           private
    │ │ └─/mnt/a/sys/fs/cgroup/pids             private
    │ ├─/mnt/a/sys/fs/pstore                    private
    │ ├─/mnt/a/sys/kernel/config                private
    │ ├─/mnt/a/sys/fs/selinux                   private
    │ └─/mnt/a/sys/kernel/debug                 private
    ├─/mnt/a/dev                                private
    │ ├─/mnt/a/dev/shm                          private
    │ ├─/mnt/a/dev/pts                          private
    │ ├─/mnt/a/dev/mqueue                       private
    │ └─/mnt/a/dev/hugepages                    private
    ├─/mnt/a/run                                private
    │ ├─/mnt/a/run/user/1000                    private
    │ └─/mnt/a/run/user/42                      private
    ├─/mnt/a/proc                               private
    │ └─/mnt/a/proc/sys/fs/binfmt_misc          private
    ├─/mnt/a/tmp                                private
    ├─/mnt/a/boot                               private
    └─/mnt/a/mnt/a                              private

Best Answer

So you think umount spends time waiting for something (as it spends very little cpu time in either user or sys). Let's find out why it waits...

# perf trace -g -e sched:* umount2 -R /mnt/a

perf record shows us hitting several scheduler tracepoints; it turned out the revealing one is sched:sched_switch.

Samples: 21  of event 'sched:sched_switch', Event count (approx.): 21
  Children      Self  Trace output                                                                                                                   ▒
-  100.00%   100.00%  umount:1888 [120] D ==> swapper/3:0 [120]                                                                                      ▒
     0                                                                                                                                               ▒
     __umount2                                                                                                                                       ▒
     entry_SYSCALL_64_fastpath                                                                                                                       ▒
     sys_umount                                                                                                                                      ▒
     do_umount                                                                                                                                       ▒
     namespace_unlock                                                                                                                                ▒
     synchronize_sched                                                                                                                               ▒
     __wait_rcu_gp                                                                                                                                   ▒
     wait_for_completion                                                                                                                             ▒
     schedule_timeout                                                                                                                                ▒
     schedule                                                                                                                                        ▒
     __schedule                                                                                                                                      ▒
     __schedule

__wait_rcu_gp() refers to an RCU grace period. namespace_unlock() in fs/namespace.c is some form of global synchronization, which includes synchronize_rcu(). It waits until all "currently executing RCU read-side critical sections have completed". "RCU grace periods extend for multiple milliseconds... this situation is a major reason for the rule of thumb that RCU be used in read-mostly situations". I suppose that mount namespaces are considered to be "read-mostly".

It looks this these "few milliseconds" account for the average wait of 5 milliseconds in each of the 34 calls to umount2().

Related Solutions

Linux – How to List Only Bind Mounts

Bind mounts are not a filesystem type, nor a parameter of a mounted filesystem; they're parameters of a mount operation. As far as I know, the following sequences of commands lead to essentially identical system states as far as the kernel is concerned:

mount /dev/foo /mnt/one; mount --bind /mnt/one /mnt/two
mount /dev/foo /mnt/two; mount --bind /mnt/two /mnt/one

So the only way to remember what mounts were bind mounts is the log of mount commands left in /etc/mtab. A bind mount operation is indicated by the bind mount option (which causes the filesystem type to be ignored). But mount has no option to list only filesystems mounted with a particular set of sets of options. Therefore you need to do your own filtering.

mount | grep -E '[,(]bind[,)]'
</etc/mtab awk '$4 ~ /(^|,)bind(,|$)/'

Note that /etc/mtab is only useful here if it's a text file maintained by mount. Some distributions set up /etc/mtab as a symbolic link to /proc/mounts instead; /proc/mounts is mostly equivalent to /etc/mtab but does have a few differences, one of which is not tracking bind mounts.

One piece of information that is retained by the kernel, but not shown in /proc/mounts, is when a mount point only shows a part of the directory tree on the mounted filesystem. In practice this mostly happens with bind mounts:

mount --bind /mnt/one/sub /mnt/partial

In /proc/mounts, the entries for /mnt/one and /mnt/partial have the same device, the same filesystem type and the same options. The information that /mnt/partial only shows the part of the filesystem that's rooted at /sub is visible in the per-process mount point information in /proc/$pid/mountinfo (column 4). Entries there look like this:

12 34 56:78 / /mnt/one rw,relatime - ext3 /dev/foo rw,errors=remount-ro,data=ordered
12 34 56:78 /sub /mnt/partial rw,relatime - ext3 /dev/foo rw,errors=remount-ro,data=ordered

Linux – Bind Mounts of Device Nodes Break with EACCES in Root of tmpfs

Well, this seems to be a very interesting effect, which is a consequence of three mechanisms combined together.

The first (trivial) point is that when you redirect something to the file, the shell opens the target file with the O_CREAT option to be sure that the file will be created if it does not yet exist.

The second thing to consider is the fact that /tmp/x is a tmpfs mountpoint, while /tmp/x/y is an ordinary directory. Given that you mount tmpfs with no options, the mountpoint's permissions automagically change so that it becomes world-writable and has a sticky bit (1777, which is a usual set of permissions for /tmp, so this feels like a sane default), while the permissions for /tmp/x/y are probably 0755 (depends on your umask).

Finally, the third part of the puzzle is the way you set up the user namespace: you instruct unshare(1) to map UID/GID of your host user to the same UID/GID in the new namespace. This is the only mapping in new namespace, so trying to translate any other UID between the parent/child namespaces will result in so-called overflow UID, which by default is 65534 — a nobody user (see user_namespaces(7), section Unmapped user and group IDs). This makes /dev/null (and its bind-mounts) be owned by nobody inside the child user namespace (as there is no mapping for host's root user in the child user namespace):

$ ls -l /dev/null
crw-rw-rw- 1 nobody nobody 1, 3 Nov 25 21:54 /dev/null

Combining all the facts together we come to the following: echo > /tmp/x/null tries to open an existing file with O_CREAT option, while this file resides inside the world-writable sticky directory and is owned by nobody, who is not the owner of the directory containing it.

Now, read openat(2) carefully, word by word:

EACCES

Where O_CREAT is specified, the protected_fifos or protected_regular sysctl is enabled, the file already exists and is a FIFO or regular file, the owner of the file is neither the current user nor the owner of the containing directory, and the containing directory is both world- or group-writable and sticky. For details, see the descriptions of /proc/sys/fs/protected_fifos and /proc/sys/fs/protected_regular in proc(5).

Isn't this brilliant? This seems almost like our case... Except the fact that the man page tells only about ordinary files and FIFOs and tells nothing about device nodes.

Well, let's take a look at the code which actually implements this. We can see that, essentially, it first checks for exceptional cases which must succeed (the first if), and then it just denies the access for any other case if the sticky directory is world-writable (the second if, first condition):

static int may_create_in_sticky(umode_t dir_mode, kuid_t dir_uid,
        struct inode * const inode)
{
  if ((!sysctl_protected_fifos && S_ISFIFO(inode->i_mode)) ||
      (!sysctl_protected_regular && S_ISREG(inode->i_mode)) ||
      likely(!(dir_mode & S_ISVTX)) ||
      uid_eq(inode->i_uid, dir_uid) ||
      uid_eq(current_fsuid(), inode->i_uid))
    return 0;

  if (likely(dir_mode & 0002) ||
      (dir_mode & 0020 &&
       ((sysctl_protected_fifos >= 2 && S_ISFIFO(inode->i_mode)) ||
        (sysctl_protected_regular >= 2 && S_ISREG(inode->i_mode))))) {
    const char *operation = S_ISFIFO(inode->i_mode) ?
          "sticky_create_fifo" :
          "sticky_create_regular";
    audit_log_path_denied(AUDIT_ANOM_CREAT, operation);
    return -EACCES;
  }
  return 0;
}

So, if the target file is a char device (not a regular file or a FIFO), the kernel still denies opening it with O_CREAT when this file is in the world-writable sticky directory.

To prove that I found the reason correctly, we may check that the problem disappears in any of the following cases:

mount tmpfs with -o mode=777 — this will not make the mountpoint have a sticky bit;
open /tmp/x/null as O_WRONLY, but without O_CREAT option (to test this, write a program calling open("/tmp/x/null", O_WRONLY | O_CREAT) and open("/tmp/x/null", O_WRONLY), then compile and run it under strace -e trace=openat to see the returned values for each call).

I'm not sure whether this behavior should be considered a kernel bug or not, but the documentation for openat(2) clearly does not cover all the cases when this syscall actually fails with EACCES.

Further tests

Best Answer

Related Solutions

Linux – How to List Only Bind Mounts

Linux – Bind Mounts of Device Nodes Break with EACCES in Root of tmpfs

Related Question