Linux – why do Linux bind mounts disappear if the mount point’s inode changes

bind-mountinodelinuxmount

In summary: if you bind mount a file /tmp/a on top of /tmp/b in a new mount namespace, but then the inode of /tmp/b changes in the parent namespace, the bind mount disappears in the child namespace. I'm trying to understand why.

mount(8) doesn't expose the ability to bind mount individual files (just directories), so reproducing this requires an additional executable that can issue the necessary mount(2) syscall. Here's a simple example (referred to as bmount below):

#include <sys/mount.h>
#include <errno.h>
#include <stdio.h>
#include <string.h>

int main(int argc, char *argv[]) {
    if (argc != 3) {
        printf("requires exactly 2 args\n");
        return 1;
    }

    int err = mount(argv[1], argv[2], "none", MS_BIND, NULL);
    if (err == 0) {
        return 0;
    } else {
        printf("mount error (%d): %s\n", errno, strerror(errno));
        return 1;
    }
}

Set up the test case:

# echo a > /tmp/a; echo b > /tmp/b; echo c > /tmp/c;
# ls -ldi /tmp/a /tmp/b /tmp/c
11403315 -rw-r--r-- 1 root root 2 Jan 19 13:34 /tmp/a                                                               
11403422 -rw-r--r-- 1 root root 2 Jan 19 13:34 /tmp/b
11403452 -rw-r--r-- 1 root root 2 Jan 19 13:34 /tmp/c

Now, in a separate shell:

# unshare -m /bin/bash
# bmount /tmp/a /tmp/b
# ls -ldi /tmp/a /tmp/b /tmp/c
11403315 -rw-r--r-- 1 root root 2 Jan 19 13:34 /tmp/a
11403315 -rw-r--r-- 1 root root 2 Jan 19 13:34 /tmp/b
11403452 -rw-r--r-- 1 root root 2 Jan 19 13:34 /tmp/c
# cat /tmp/b
a
# grep "\/tmp\/" /proc/self/mounts
[redacted] /tmp/b ext4 rw,relatime,errors=remount-ro,data=ordered 0 0

In the original shell:

# mv /tmp/c /tmp/b
# ls -ldi /tmp/a /tmp/b /tmp/c
ls: cannot access '/tmp/c': No such file or directory                                                               
11403315 -rw-r--r-- 1 root root 2 Jan 19 13:34 /tmp/a                                                               
11403452 -rw-r--r-- 1 root root 2 Jan 19 13:34 /tmp/b

In the unshare shell:

# ls -ldi /tmp/a /tmp/b /tmp/c
ls: cannot access '/tmp/c': No such file or directory
11403315 -rw-r--r-- 1 root root 2 Jan 19 13:34 /tmp/a
11403452 -rw-r--r-- 1 root root 2 Jan 19 13:34 /tmp/b
# cat /tmp/b
c
# grep "\/tmp\/" /proc/self/mounts
#

The bind mount has silently disappeared, and the underlying filesystem's /tmp/b file is now visible inside the namespace.

I found a lwn.net article which describes a change in semantics here: before 2013, the mv command's rename(2) on the mount point would fail with EBUSY, but the behavior was changed so that it would succeed and then the mount would be removed. The relevant kernel commit appears to be 8ed936b5671.

The questions I have are:

  1. Why is the mount removed on any inode change? Is it just an implementation detail of the mount system, where the mount point is identified by a dentry rather than a simple path?
  2. Is there a way to make bind mounts that are less "brittle" in the sense that they can't be overridden or removed by filesystem operations outside their namespace?

One case where this is relevant in practice is ip-netns(8); ip netns exec works by bind mounting /etc/netns/${NAMESPACE}/resolv.conf on top of /etc/resolv.conf. If the inode of /etc/resolv.conf is altered by resolvconf(8) or systemd-resolved, the updated /etc/resolv.conf will be visible to the process running inside the namespace.

Best Answer

This is mount propagation. Linux does not enable it by default, but systemd does. If you don't want mounts and unmounts to propagate to the new namespace, you can e.g. run mount --make-rprivate / inside it.. Narrator: this is not mount propagation.

Why is the mount removed on any inode change? Is it just an implementation detail of the mount system, where the mount point is identified by a dentry rather than a simple path?

I would say that the only different you can expect between rm b; mv c b and mv c b, is that it is not possible to observe b as non-existent at any point. I would describe this as a feature which has been deliberately engineered or maintained... I'm not sure to what extent this is true of the historical multi-user Unix system, but it certainly came to be relied upon e.g. to support software updates on a running system.

I... can think of exactly one other specific feature which has been implemented for what you call "inode change" - this was done begrudgingly and is filesystem-specific.

Related Question