Linux – `umount -R` on bind mounts takes a non-neglible amount of time, why

bind-mountlinuxmountnamespaceperformance

Why does umount -R take 0.2s to unmount this subtree of bind mounts? It only took 0.02s to mount the subtree, and 0.00s to change its propagation flags.

(I was looking at cloning the current set of mounts, to a subdirectory, changing them, then switching to them using pivot_mount. But the delay I'm observing would not really be acceptable for the purpose).

This exercise assumes / and sub-mounts are shared mounts. Linux doesn't make them so by default, but systemd does.

# mkdir /mnt/a
# mount --bind /mnt/a /mnt/a --make-private
# time mount --rbind / /mnt/a
0.00user 0.00system 0:00.02elapsed 9%CPU (0avgtext+0avgdata 3020maxresident)k
0inputs+0outputs (0major+135minor)pagefaults 0swaps
# time mount --make-rprivate /mnt/a
0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 3184maxresident)k
0inputs+0outputs (0major+136minor)pagefaults 0swaps
# time umount -R /mnt/a
0.00user 0.00system 0:00.19elapsed 9%CPU (0avgtext+0avgdata 3392maxresident)k
0inputs+0outputs (0major+194minor)pagefaults 0swaps

Further tests

Running under strace -cw shows

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 90.44    0.180113        5297        34           umount2
 ...

So it's not super revealing, other than pointing out that this last operation requires 34 individual umount2() calls, whereas both the others only include one call to mount(), using the MS_REC (recursive) flag. As with the headline figures from time, this is wall-clock time. strace -c shows system time (i.e. cpu time spent in the kernel), which only has a total of 0.009s.

Although it does point to something interesting. Using umount -l /mnt/a instead, reduces the total time to 0.02s. This uses a single umount2() call to detach the subtree /mnt/a (and do any clean up in the background).

Looking at the individual calls with strace -ttt -T -e trace=umount2 umount -R /mnt/a, the distribution of times for individual calls is relatively even; they range from 0.002s to 0.012s, but there is no clear pattern, and the pattern does not look consistent if I repeat it.


After running umount -R under perf record -a, perf report shows several hotspots in gsd-housekeeping, gvfs-udisks2-volume-monitor, and systemd. The umount process doesn't show up at all. This could explain why time showed umount spending neglible cpu time either in the kernel or in userspace.

(I'd be very interested if anyone has a more comprehensive way to aggregate cpu usage per process during a test :).

The other processes are probably doing some processing in response to each mount event.

E.g. systemd seems to be responsible for using 0.13s of one of my four cpus, during a run that took 0.4s:

# systemctl set-property init.scope CPUAccounting=yes
# systemctl show --property CPUUsageNSec init.scope; time umount -R /mnt/a ; systemctl show --property CPUUsageNSec init.scope
CPUUsageNSec=2403124481

real    0m0.408s
user    0m0.015s
sys 0m0.020s
CPUUsageNSec=2534058385

# echo $(( 2534058385 - 2403124481 ))
130933904

But this does not seem to be the right explanation, because the same delay is happening when running in a private mount namespace. In this case, perf record -a does not show the other processes, only the umount process (and perf itself).

# unshare -m
# time mount --rbind / /mnt/a

real    0m0.005s
user    0m0.003s
sys 0m0.002s
# time mount --make-rprivate /mnt/a

real    0m0.005s
user    0m0.003s
sys 0m0.002s
# systemctl show --property CPUUsageNSec init.scope; time umount -R /mnt/a ; systemctl show --property CPUUsageNSec init.scope
CPUUsageNSec=3637792026

real    0m0.381s
user    0m0.026s
sys 0m0.018s
CPUUsageNSec=3645973005
# echo $((3645973005-3637792026))
8180979

The cpu doesn't seem to be relevant in this case. I have 4 cpu cores capable of running at 2.3Ghz, but perf stat -a shows less than 5% cpu usage overall. (Ignore "CPUs utilized, I think that always shows the full value when -a is used).

# time perf stat -a umount -R /mnt/a

 Performance counter stats for 'system wide':

       2079.333650      cpu-clock (msec)          #    3.998 CPUs utilized          
               635      context-switches          #    0.305 K/sec                  
                23      cpu-migrations            #    0.011 K/sec                  
               333      page-faults               #    0.160 K/sec                  
       198,278,822      cycles                    #    0.095 GHz                    
       138,734,277      instructions              #    0.70  insn per cycle         
        31,401,067      branches                  #   15.102 M/sec                  
           934,327      branch-misses             #    2.98% of all branches        

       0.520083596 seconds time elapsed


real    0m0.543s
user    0m0.038s
sys 0m0.043s

However there is still some process responding to this event… The umount still triggers 78 lines of messages in the system log.

Feb 07 10:34:26 alan-laptop systemd[1]: proc-sys-fs-binfmt_misc.automount: Got automount request for /proc/sys/fs/binfmt_misc, triggered by 6040 (umount)
Feb 07 10:34:26 alan-laptop systemd[1]: proc-sys-fs-binfmt_misc.automount: Automount point already active?
Feb 07 10:34:26 alan-laptop systemd[1]: proc-sys-fs-binfmt_misc.automount: Got automount request for /proc/sys/fs/binfmt_misc, triggered by 6040 (umount)
Feb 07 10:34:26 alan-laptop systemd[1]: proc-sys-fs-binfmt_misc.automount: Automount point already active?

Note findmnt shows I avoided creating any horrific recursive-like propagation, e.g. if I run it after the --make-rprivate:

findmnt -o TARGET,PROPAGATION
TARGET                                          PROPAGATION
/                                               shared
├─/sys                                          shared
│ ├─/sys/kernel/security                        shared
│ ├─/sys/fs/cgroup                              shared
│ │ ├─/sys/fs/cgroup/unified                    shared
│ │ ├─/sys/fs/cgroup/systemd                    shared
│ │ ├─/sys/fs/cgroup/net_cls,net_prio           shared
│ │ ├─/sys/fs/cgroup/cpu,cpuacct                shared
│ │ ├─/sys/fs/cgroup/devices                    shared
│ │ ├─/sys/fs/cgroup/freezer                    shared
│ │ ├─/sys/fs/cgroup/perf_event                 shared
│ │ ├─/sys/fs/cgroup/hugetlb                    shared
│ │ ├─/sys/fs/cgroup/memory                     shared
│ │ ├─/sys/fs/cgroup/blkio                      shared
│ │ ├─/sys/fs/cgroup/cpuset                     shared
│ │ └─/sys/fs/cgroup/pids                       shared
│ ├─/sys/fs/pstore                              shared
│ ├─/sys/fs/selinux                             shared
│ ├─/sys/kernel/debug                           shared
│ └─/sys/kernel/config                          shared
├─/proc                                         shared
│ └─/proc/sys/fs/binfmt_misc                    shared
├─/dev                                          shared
│ ├─/dev/shm                                    shared
│ ├─/dev/pts                                    shared
│ ├─/dev/mqueue                                 shared
│ └─/dev/hugepages                              shared
├─/run                                          shared
│ ├─/run/user/1000                              shared
│ └─/run/user/42                                shared
├─/usr                                          shared
├─/tmp                                          shared
├─/boot                                         shared
└─/mnt/a                                        private
  └─/mnt/a                                      private
    ├─/mnt/a/usr                                private
    ├─/mnt/a/sys                                private
    │ ├─/mnt/a/sys/kernel/security              private
    │ ├─/mnt/a/sys/fs/cgroup                    private
    │ │ ├─/mnt/a/sys/fs/cgroup/unified          private
    │ │ ├─/mnt/a/sys/fs/cgroup/systemd          private
    │ │ ├─/mnt/a/sys/fs/cgroup/net_cls,net_prio private
    │ │ ├─/mnt/a/sys/fs/cgroup/cpu,cpuacct      private
    │ │ ├─/mnt/a/sys/fs/cgroup/devices          private
    │ │ ├─/mnt/a/sys/fs/cgroup/freezer          private
    │ │ ├─/mnt/a/sys/fs/cgroup/perf_event       private
    │ │ ├─/mnt/a/sys/fs/cgroup/hugetlb          private
    │ │ ├─/mnt/a/sys/fs/cgroup/memory           private
    │ │ ├─/mnt/a/sys/fs/cgroup/blkio            private
    │ │ ├─/mnt/a/sys/fs/cgroup/cpuset           private
    │ │ └─/mnt/a/sys/fs/cgroup/pids             private
    │ ├─/mnt/a/sys/fs/pstore                    private
    │ ├─/mnt/a/sys/kernel/config                private
    │ ├─/mnt/a/sys/fs/selinux                   private
    │ └─/mnt/a/sys/kernel/debug                 private
    ├─/mnt/a/dev                                private
    │ ├─/mnt/a/dev/shm                          private
    │ ├─/mnt/a/dev/pts                          private
    │ ├─/mnt/a/dev/mqueue                       private
    │ └─/mnt/a/dev/hugepages                    private
    ├─/mnt/a/run                                private
    │ ├─/mnt/a/run/user/1000                    private
    │ └─/mnt/a/run/user/42                      private
    ├─/mnt/a/proc                               private
    │ └─/mnt/a/proc/sys/fs/binfmt_misc          private
    ├─/mnt/a/tmp                                private
    ├─/mnt/a/boot                               private
    └─/mnt/a/mnt/a                              private

Best Answer

So you think umount spends time waiting for something (as it spends very little cpu time in either user or sys). Let's find out why it waits...

# perf trace -g -e sched:* umount2 -R /mnt/a

perf record shows us hitting several scheduler tracepoints; it turned out the revealing one is sched:sched_switch.

Samples: 21  of event 'sched:sched_switch', Event count (approx.): 21
  Children      Self  Trace output                                                                                                                   ▒
-  100.00%   100.00%  umount:1888 [120] D ==> swapper/3:0 [120]                                                                                      ▒
     0                                                                                                                                               ▒
     __umount2                                                                                                                                       ▒
     entry_SYSCALL_64_fastpath                                                                                                                       ▒
     sys_umount                                                                                                                                      ▒
     do_umount                                                                                                                                       ▒
     namespace_unlock                                                                                                                                ▒
     synchronize_sched                                                                                                                               ▒
     __wait_rcu_gp                                                                                                                                   ▒
     wait_for_completion                                                                                                                             ▒
     schedule_timeout                                                                                                                                ▒
     schedule                                                                                                                                        ▒
     __schedule                                                                                                                                      ▒
     __schedule   

__wait_rcu_gp() refers to an RCU grace period. namespace_unlock() in fs/namespace.c is some form of global synchronization, which includes synchronize_rcu(). It waits until all "currently executing RCU read-side critical sections have completed". "RCU grace periods extend for multiple milliseconds... this situation is a major reason for the rule of thumb that RCU be used in read-mostly situations". I suppose that mount namespaces are considered to be "read-mostly".

It looks this these "few milliseconds" account for the average wait of 5 milliseconds in each of the 34 calls to umount2().

Related Question