Debian – Why does child with mount namespace affect parent mounts

debianmountnamespace

I am trying to understand Linux namespaces using a Debian jessie server where I have root access.

Consider this C code:

# /tmp/test.c
#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <unistd.h>

static char child_stack[1048576];

static int my_child() {
  system("/bin/bash");
}

int main() {
  pid_t child_pid = clone(my_child, child_stack+1048576,
                          CLONE_NEWPID | CLONE_NEWNS | SIGCHLD, NULL);
  waitpid(child_pid, NULL, 0);
  return 0;
}

Next, I run these commands in a single session in a single terminal:

/tmp# id
uid=0(root) gid=0(root) groups=0(root),1093867019
/tmp# echo $$
1804
/tmp# ps -eaf | head
 UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 May08 ?        00:00:01 /sbin/init
root         2     0  0 May08 ?        00:00:00 [kthreadd]
root         3     2  0 May08 ?        00:00:00 [ksoftirqd/0]
root         5     2  0 May08 ?        00:00:00 [kworker/0:0H]
root         7     2  0 May08 ?        00:00:11 [rcu_sched]
root         8     2  0 May08 ?        00:00:00 [rcu_bh]
root         9     2  0 May08 ?        00:00:00 [migration/0]
root        10     2  0 May08 ?        00:00:00 [watchdog/0]
root        11     2  0 May08 ?        00:00:00 [khelper]
/tmp# grep /proc /proc/$$/mountinfo
15 19 0:3 / /proc rw,nosuid,nodev,noexec,relatime shared:12 - proc proc rw
33 15 0:29 / /proc/sys/fs/binfmt_misc rw,relatime shared:20 - autofs systemd-1 rw,fd=22,pgrp=1,timeout=300,minproto=5,maxproto=5,direct
/tmp# gcc test.c
/tmp# ./a.out
/tmp# echo $$
2
/tmp# echo $PPID
1
/tmp# ps -eaf | head
 UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 May08 ?        00:00:01 /sbin/init
root         2     0  0 May08 ?        00:00:00 [kthreadd]
root         3     2  0 May08 ?        00:00:00 [ksoftirqd/0]
root         5     2  0 May08 ?        00:00:00 [kworker/0:0H]
root         7     2  0 May08 ?        00:00:11 [rcu_sched]
root         8     2  0 May08 ?        00:00:00 [rcu_bh]
root         9     2  0 May08 ?        00:00:00 [migration/0]
root        10     2  0 May08 ?        00:00:00 [watchdog/0]
root        11     2  0 May08 ?        00:00:00 [khelper]
tmp# grep /proc /proc/$$/mountinfo
15 19 0:3 / /proc rw,nosuid,nodev,noexec,relatime shared:12 - proc proc rw
33 15 0:29 / /proc/sys/fs/binfmt_misc rw,relatime shared:20 - autofs systemd-1 rw,fd=22,pgrp=0,timeout=300,minproto=5,maxproto=5,direct
/tmp# mount -t proc proc /proc
/tmp# grep /proc /proc/$$/mountinfo
92 70 0:3 / /proc rw,nosuid,nodev,noexec,relatime shared:12 - proc proc rw
93 92 0:29 / /proc/sys/fs/binfmt_misc rw,relatime shared:20 - autofs systemd-1 rw,fd=22,pgrp=0,timeout=300,minproto=5,maxproto=5,direct
97 92 0:34 / /proc rw,relatime shared:27 - proc proc rw
/tmp# ps -eaf
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 07:31 pts/0    00:00:00 ./a.out
root         2     1  0 07:31 pts/0    00:00:00 /bin/bash
root        14     2  0 07:31 pts/0    00:00:00 ps -eaf
/tmp# exit
exit
/tmp# echo $$
1804
/tmp# grep /proc /proc/$$/mountinfo
grep: /proc/1804/mountinfo: No such file or directory
/tmp# ps -eaf
Error, do this: mount -t proc proc /proc
/tmp# mount -t proc proc /proc
/tmp# grep /proc /proc/$$/mountinfo
15 19 0:3 / /proc rw,nosuid,nodev,noexec,relatime shared:12 - proc proc rw
33 15 0:29 / /proc/sys/fs/binfmt_misc rw,relatime shared:20 - autofs systemd-1 rw,fd=22,pgrp=1,timeout=300,minproto=5,maxproto=5,direct
98 15 0:34 / /proc rw,relatime shared:27 - proc proc rw
69 98 0:3 / /proc rw,relatime shared:28 - proc proc rw

Why, after exiting the child process, is /proc not mounted? Shouldn't any changes to mount points made by the child process not affect the parent's mount points? This seems to contradict the answer given by Emmet to the question https://stackoverflow.com/questions/22889241/linux-understanding-the-mount-namespace-clone-clone-newns-flag.

Best Answer

When a child process is created with clone with the CLONE_NEWNS flag, the child process gets its own mount namespace. Mounting operations (mount, umount, mount --bind, etc.) in the child namespace only have an effect inside that namespace, and mount operations in the parent namespace only have an effect outside the new namespace.

Except, that is, for shared mounts. A mount can be shared, in which case operations affect all the namespaces that the mount is shared in. A typical use case for shared mounts is to make removable drives available in child namespaces such as chroots. There are more types of relationships (private mounts, unbindable mounts); for more details, see the kernel documentation.

You can check whether a mount is shared by checking /proc/PID/mountinfo: if the line contains shared:NUMBER then the mount is shared, and the number is a unique value identifying the set of namespaces that it's shared between. If the line contains no such indication, the mount is private.

On your system, /proc is shared. When you mount a new instance of proc in the child namespace, since you're mounting over the parent's /proc, that new instance is also shared, so it's visible in both the child namespace and the parent namespace. When you exit the child namespace, the second instance of /proc remains mounted, since it's shared with the still-active parent namespace.

Two things complicate your scenario: you're also creating a PID namespace, and you're using /proc both as the subject of the experiment and as a means of observation. When ps complains about /proc not being mounted, it's actually displaying a misleading error message — the wrong proc is mounted (a proc for the wrong namespace). You can observe this with ls /proc and cat /proc/1/mountinfo. I recommend doing the experiments with a scratch filesystem, it would be easier to understand what was going on.

parent# ./a.out
child# echo $$
2
child# ls /proc
This is the parent's proc, with /proc/PID in the parent PID namespace
child# ps 1
… init
child# mount -t proc proc /proc
Now /proc in the child mount namespace is for the child PID namespace
child# ps 1
… a.out
child# exit
parent#

So far it didn't matter whether /proc was private or shared, but now it does. If /proc is private, then at this point we're observing the parent's /proc, which was never affected and shows the PID namespace. But if /proc is shared, then the mount command we issued earlier affected both namespaces, thus:

parent# ls /proc
acpi asound buddyinfo …
parent# ps 1
Error, do this: mount -t proc proc /proc
Actually, /proc is mounted, but it's the proc for the PID namespace that we created earlier and now has zero running processes.
parent# grep -c ' /proc ' /proc/mounts
2
parent# umount /proc
We've unmounted the child PID namespace's /proc that was shadowing the parent namespace's /proc, so the “normal” /proc is visible again.
parent# ps 1
… init
Related Question