Linux – why do Linux bind mounts disappear if the mount point’s inode changes

bind-mountinodelinuxmount

In summary: if you bind mount a file /tmp/a on top of /tmp/b in a new mount namespace, but then the inode of /tmp/b changes in the parent namespace, the bind mount disappears in the child namespace. I'm trying to understand why.

mount(8) doesn't expose the ability to bind mount individual files (just directories), so reproducing this requires an additional executable that can issue the necessary mount(2) syscall. Here's a simple example (referred to as bmount below):

#include <sys/mount.h>
#include <errno.h>
#include <stdio.h>
#include <string.h>

int main(int argc, char *argv[]) {
    if (argc != 3) {
        printf("requires exactly 2 args\n");
        return 1;
    }

    int err = mount(argv[1], argv[2], "none", MS_BIND, NULL);
    if (err == 0) {
        return 0;
    } else {
        printf("mount error (%d): %s\n", errno, strerror(errno));
        return 1;
    }
}

Set up the test case:

# echo a > /tmp/a; echo b > /tmp/b; echo c > /tmp/c;
# ls -ldi /tmp/a /tmp/b /tmp/c
11403315 -rw-r--r-- 1 root root 2 Jan 19 13:34 /tmp/a                                                               
11403422 -rw-r--r-- 1 root root 2 Jan 19 13:34 /tmp/b
11403452 -rw-r--r-- 1 root root 2 Jan 19 13:34 /tmp/c

Now, in a separate shell:

# unshare -m /bin/bash
# bmount /tmp/a /tmp/b
# ls -ldi /tmp/a /tmp/b /tmp/c
11403315 -rw-r--r-- 1 root root 2 Jan 19 13:34 /tmp/a
11403315 -rw-r--r-- 1 root root 2 Jan 19 13:34 /tmp/b
11403452 -rw-r--r-- 1 root root 2 Jan 19 13:34 /tmp/c
# cat /tmp/b
a
# grep "\/tmp\/" /proc/self/mounts
[redacted] /tmp/b ext4 rw,relatime,errors=remount-ro,data=ordered 0 0

In the original shell:

# mv /tmp/c /tmp/b
# ls -ldi /tmp/a /tmp/b /tmp/c
ls: cannot access '/tmp/c': No such file or directory                                                               
11403315 -rw-r--r-- 1 root root 2 Jan 19 13:34 /tmp/a                                                               
11403452 -rw-r--r-- 1 root root 2 Jan 19 13:34 /tmp/b

In the unshare shell:

# ls -ldi /tmp/a /tmp/b /tmp/c
ls: cannot access '/tmp/c': No such file or directory
11403315 -rw-r--r-- 1 root root 2 Jan 19 13:34 /tmp/a
11403452 -rw-r--r-- 1 root root 2 Jan 19 13:34 /tmp/b
# cat /tmp/b
c
# grep "\/tmp\/" /proc/self/mounts
#

The bind mount has silently disappeared, and the underlying filesystem's /tmp/b file is now visible inside the namespace.

I found a lwn.net article which describes a change in semantics here: before 2013, the mv command's rename(2) on the mount point would fail with EBUSY, but the behavior was changed so that it would succeed and then the mount would be removed. The relevant kernel commit appears to be 8ed936b5671.

The questions I have are:

Why is the mount removed on any inode change? Is it just an implementation detail of the mount system, where the mount point is identified by a dentry rather than a simple path?
Is there a way to make bind mounts that are less "brittle" in the sense that they can't be overridden or removed by filesystem operations outside their namespace?

One case where this is relevant in practice is ip-netns(8); ip netns exec works by bind mounting /etc/netns/${NAMESPACE}/resolv.conf on top of /etc/resolv.conf. If the inode of /etc/resolv.conf is altered by resolvconf(8) or systemd-resolved, the updated /etc/resolv.conf will be visible to the process running inside the namespace.

Best Answer

This is mount propagation. Linux does not enable it by default, but systemd does. If you don't want mounts and unmounts to propagate to the new namespace, you can e.g. run mount --make-rprivate / inside it.. Narrator: this is not mount propagation.

Why is the mount removed on any inode change? Is it just an implementation detail of the mount system, where the mount point is identified by a dentry rather than a simple path?

I would say that the only different you can expect between rm b; mv c b and mv c b, is that it is not possible to observe b as non-existent at any point. I would describe this as a feature which has been deliberately engineered or maintained... I'm not sure to what extent this is true of the historical multi-user Unix system, but it certainly came to be relied upon e.g. to support software updates on a running system.

I... can think of exactly one other specific feature which has been implemented for what you call "inode change" - this was done begrudgingly and is filesystem-specific.

Related Solutions

Linux – How to List Only Bind Mounts

Bind mounts are not a filesystem type, nor a parameter of a mounted filesystem; they're parameters of a mount operation. As far as I know, the following sequences of commands lead to essentially identical system states as far as the kernel is concerned:

mount /dev/foo /mnt/one; mount --bind /mnt/one /mnt/two
mount /dev/foo /mnt/two; mount --bind /mnt/two /mnt/one

So the only way to remember what mounts were bind mounts is the log of mount commands left in /etc/mtab. A bind mount operation is indicated by the bind mount option (which causes the filesystem type to be ignored). But mount has no option to list only filesystems mounted with a particular set of sets of options. Therefore you need to do your own filtering.

mount | grep -E '[,(]bind[,)]'
</etc/mtab awk '$4 ~ /(^|,)bind(,|$)/'

Note that /etc/mtab is only useful here if it's a text file maintained by mount. Some distributions set up /etc/mtab as a symbolic link to /proc/mounts instead; /proc/mounts is mostly equivalent to /etc/mtab but does have a few differences, one of which is not tracking bind mounts.

One piece of information that is retained by the kernel, but not shown in /proc/mounts, is when a mount point only shows a part of the directory tree on the mounted filesystem. In practice this mostly happens with bind mounts:

mount --bind /mnt/one/sub /mnt/partial

In /proc/mounts, the entries for /mnt/one and /mnt/partial have the same device, the same filesystem type and the same options. The information that /mnt/partial only shows the part of the filesystem that's rooted at /sub is visible in the per-process mount point information in /proc/$pid/mountinfo (column 4). Entries there look like this:

12 34 56:78 / /mnt/one rw,relatime - ext3 /dev/foo rw,errors=remount-ro,data=ordered
12 34 56:78 /sub /mnt/partial rw,relatime - ext3 /dev/foo rw,errors=remount-ro,data=ordered

Linux – Why doesn’t mount respect the read only option for bind mounts

Bind mount is just... well... a bind mount. I.e. it's not a new mount. It just "links"/"exposes"/"considers" a subdirectory as a new mount point. As such it cannot alter the mount parameters. That's why you're getting complaints:

# mount /mnt/1/lala /mnt/2 -o bind,ro
mount: warning: /mnt/2 seems to be mounted read-write.

But as you said a normal bind mount works:

# mount /mnt/1/lala /mnt/2 -o bind

And then a ro remount also works:

# mount /mnt/1/lala /mnt/2 -o bind,remount,ro

However what happens is that you're changing the whole mount and not just this bind mount. If you take a look at /proc/mounts you'll see that both bind mount and the original mount change to read-only:

/dev/loop0 /mnt/1 ext2 ro,relatime,errors=continue,user_xattr,acl 0 0
/dev/loop0 /mnt/2 ext2 ro,relatime,errors=continue,user_xattr,acl 0 0

So what you're doing is like changing the initial mount to a read-only mount and then doing a bind mount which will of course be read-only.

UPDATE 2016-07-20:

The following are true for 4.5 kernels, but not true for 4.3 kernels (This is wrong. See update #2 below):

The kernel has two flags that control read-only:

The MS_READONLY: Indicating whether the mount is read-only
The MNT_READONLY: Indicating whether the "user" wants it read-only

On a 4.5 kernel, doing a mount -o bind,ro will actually do the trick. For example, this:

# mkdir /tmp/test
# mkdir /tmp/test/a /tmp/test/b
# mount -t tmpfs none /tmp/test/a
# mkdir /tmp/test/a/d
# mount -o bind,ro /tmp/test/a/d /tmp/test/b

will create a read-only bind mount of /tmp/test/a/d to /tmp/test/b, which will be visible in /proc/mounts as:

none /tmp/test/a tmpfs rw,relatime 0 0
none /tmp/test/b tmpfs ro,relatime 0 0

A more detailed view is visible in /proc/self/mountinfo, which takes into consideration the user view (namespace). The relevant lines will be these:

363 74 0:49 / /tmp/test/a rw,relatime shared:273 - tmpfs none rw
368 74 0:49 /d /tmp/test/b ro,relatime shared:273 - tmpfs none rw

Where on the second line, you can see that it says both ro (MNT_READONLY) and rw (!MS_READONLY).

The end result is this:

# echo a > /tmp/test/a/d/f
# echo a > /tmp/test/b/f
-su: /tmp/test/b/f: Read-only file system

UPDATE 2016-07-20 #2:

A bit more digging into this shows that the behavior in fact depends on the version of libmount which is part of util-linux. Support for this was added with this commit and was released with version 2.27:

commit 9ac77b8a78452eab0612523d27fee52159f5016a
Author: Karel Zak 
Date:   Mon Aug 17 11:54:26 2015 +0200

    libmount: add support for "bind,ro"

    Now it's necessary t use two mount(8) calls to create a read-only
    mount:

      mount /foo /bar -o bind
      mount /bar -o remount,ro,bind

    This patch allows to specify "bind,ro" and the remount is done
    automatically by libmount by additional mount(2) syscall. It's not
    atomic of course.

    Signed-off-by: Karel Zak

which also provides the workaround. The behavior can be seen using strace on an older and a newer mount:

Old:

mount("/tmp/test/a/d", "/tmp/test/b", 0x222e240, MS_MGC_VAL|MS_RDONLY|MS_BIND, NULL) = 0 <0.000681>

New:

mount("/tmp/test/a/d", "/tmp/test/b", 0x1a8ee90, MS_MGC_VAL|MS_RDONLY|MS_BIND, NULL) = 0 <0.011492>
mount("none", "/tmp/test/b", NULL, MS_RDONLY|MS_REMOUNT|MS_BIND, NULL) = 0 <0.006281>

Conclusion:

To achieve the desired result one needs to run two commands (as @Thomas already said):

mount SRC DST -o bind
mount DST -o remount,ro,bind

Newer versions of mount (util-linux >=2.27) do this automatically when one runs

mount SRC DST -o bind,ro