Linux Namespace – Why Bind Mount is Visible Outside Its Mount Namespace

bind-mountlinuxnamespace

So I'm trying to get a handle on how Linux's mount namespace works. So, I did a little experiment and opened up two terminals and ran the following:

Terminal 1

root@goliath:~# mkdir a b
root@goliath:~# touch a/foo.txt
root@goliath:~# unshare --mount -- /bin/bash
root@goliath:~# mount --bind a b
root@goliath:~# ls b
foo.txt

Terminal 2

root@goliath:~# ls b
foo.txt

How come the mount is visible in Terminal 2? Since it is not part of the mount namespace I expected the directory to appear empty here. I also tried passing -o shared=no and using --make-private options with mount, but I got the same result.

What am I missing and how can I make it actually private?

Best Answer

If you are on a systemd-based distribution with a util-linux version less than 2.27, you will see this unintuitive behavior. This is because CLONE_NEWNS propogates flags such as shared depending on a setting in the kernel. This setting is normally private, but systemd changes this to shared. As of util-linux 2.27, a patch was made that changes the default behaviour of the unshare command to use private as the default propagation behaviour as to be more intuitive.

Solution

If you are on a systemd system with util-linux prior to version 2.27, you must remount the root filesystem after running the unshare command:

# unshare --mount -- /bin/bash
# mount --make-private -o remount /

If you are on a systemd system with util-linux version 2.27 or later, it should work as expected in the example you gave in your question, verbatim, without the need to remount. If not, pass --propagation private to the unshare command to force the propagation of the mount namespace to be private.

Related Solutions

Linux – `umount -R` on bind mounts takes a non-neglible amount of time, why

So you think umount spends time waiting for something (as it spends very little cpu time in either user or sys). Let's find out why it waits...

# perf trace -g -e sched:* umount2 -R /mnt/a

perf record shows us hitting several scheduler tracepoints; it turned out the revealing one is sched:sched_switch.

Samples: 21  of event 'sched:sched_switch', Event count (approx.): 21
  Children      Self  Trace output                                                                                                                   ▒
-  100.00%   100.00%  umount:1888 [120] D ==> swapper/3:0 [120]                                                                                      ▒
     0                                                                                                                                               ▒
     __umount2                                                                                                                                       ▒
     entry_SYSCALL_64_fastpath                                                                                                                       ▒
     sys_umount                                                                                                                                      ▒
     do_umount                                                                                                                                       ▒
     namespace_unlock                                                                                                                                ▒
     synchronize_sched                                                                                                                               ▒
     __wait_rcu_gp                                                                                                                                   ▒
     wait_for_completion                                                                                                                             ▒
     schedule_timeout                                                                                                                                ▒
     schedule                                                                                                                                        ▒
     __schedule                                                                                                                                      ▒
     __schedule

__wait_rcu_gp() refers to an RCU grace period. namespace_unlock() in fs/namespace.c is some form of global synchronization, which includes synchronize_rcu(). It waits until all "currently executing RCU read-side critical sections have completed". "RCU grace periods extend for multiple milliseconds... this situation is a major reason for the rule of thumb that RCU be used in read-mostly situations". I suppose that mount namespaces are considered to be "read-mostly".

It looks this these "few milliseconds" account for the average wait of 5 milliseconds in each of the 34 calls to umount2().

Linux – why do Linux bind mounts disappear if the mount point’s inode changes

This is mount propagation. Linux does not enable it by default, but systemd does. If you don't want mounts and unmounts to propagate to the new namespace, you can e.g. run mount --make-rprivate / inside it.. Narrator: this is not mount propagation.

Why is the mount removed on any inode change? Is it just an implementation detail of the mount system, where the mount point is identified by a dentry rather than a simple path?

I would say that the only different you can expect between rm b; mv c b and mv c b, is that it is not possible to observe b as non-existent at any point. I would describe this as a feature which has been deliberately engineered or maintained... I'm not sure to what extent this is true of the historical multi-user Unix system, but it certainly came to be relied upon e.g. to support software updates on a running system.

I... can think of exactly one other specific feature which has been implemented for what you call "inode change" - this was done begrudgingly and is filesystem-specific.

Best Answer

Related Solutions

Linux – `umount -R` on bind mounts takes a non-neglible amount of time, why

Linux – why do Linux bind mounts disappear if the mount point’s inode changes

Related Question