I am trying to understand Linux namespaces using a Debian jessie server where I have root access.
Consider this C code:
# /tmp/test.c
#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <unistd.h>
static char child_stack[1048576];
static int my_child() {
system("/bin/bash");
}
int main() {
pid_t child_pid = clone(my_child, child_stack+1048576,
CLONE_NEWPID | CLONE_NEWNS | SIGCHLD, NULL);
waitpid(child_pid, NULL, 0);
return 0;
}
Next, I run these commands in a single session in a single terminal:
/tmp# id
uid=0(root) gid=0(root) groups=0(root),1093867019
/tmp# echo $$
1804
/tmp# ps -eaf | head
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 May08 ? 00:00:01 /sbin/init
root 2 0 0 May08 ? 00:00:00 [kthreadd]
root 3 2 0 May08 ? 00:00:00 [ksoftirqd/0]
root 5 2 0 May08 ? 00:00:00 [kworker/0:0H]
root 7 2 0 May08 ? 00:00:11 [rcu_sched]
root 8 2 0 May08 ? 00:00:00 [rcu_bh]
root 9 2 0 May08 ? 00:00:00 [migration/0]
root 10 2 0 May08 ? 00:00:00 [watchdog/0]
root 11 2 0 May08 ? 00:00:00 [khelper]
/tmp# grep /proc /proc/$$/mountinfo
15 19 0:3 / /proc rw,nosuid,nodev,noexec,relatime shared:12 - proc proc rw
33 15 0:29 / /proc/sys/fs/binfmt_misc rw,relatime shared:20 - autofs systemd-1 rw,fd=22,pgrp=1,timeout=300,minproto=5,maxproto=5,direct
/tmp# gcc test.c
/tmp# ./a.out
/tmp# echo $$
2
/tmp# echo $PPID
1
/tmp# ps -eaf | head
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 May08 ? 00:00:01 /sbin/init
root 2 0 0 May08 ? 00:00:00 [kthreadd]
root 3 2 0 May08 ? 00:00:00 [ksoftirqd/0]
root 5 2 0 May08 ? 00:00:00 [kworker/0:0H]
root 7 2 0 May08 ? 00:00:11 [rcu_sched]
root 8 2 0 May08 ? 00:00:00 [rcu_bh]
root 9 2 0 May08 ? 00:00:00 [migration/0]
root 10 2 0 May08 ? 00:00:00 [watchdog/0]
root 11 2 0 May08 ? 00:00:00 [khelper]
tmp# grep /proc /proc/$$/mountinfo
15 19 0:3 / /proc rw,nosuid,nodev,noexec,relatime shared:12 - proc proc rw
33 15 0:29 / /proc/sys/fs/binfmt_misc rw,relatime shared:20 - autofs systemd-1 rw,fd=22,pgrp=0,timeout=300,minproto=5,maxproto=5,direct
/tmp# mount -t proc proc /proc
/tmp# grep /proc /proc/$$/mountinfo
92 70 0:3 / /proc rw,nosuid,nodev,noexec,relatime shared:12 - proc proc rw
93 92 0:29 / /proc/sys/fs/binfmt_misc rw,relatime shared:20 - autofs systemd-1 rw,fd=22,pgrp=0,timeout=300,minproto=5,maxproto=5,direct
97 92 0:34 / /proc rw,relatime shared:27 - proc proc rw
/tmp# ps -eaf
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 07:31 pts/0 00:00:00 ./a.out
root 2 1 0 07:31 pts/0 00:00:00 /bin/bash
root 14 2 0 07:31 pts/0 00:00:00 ps -eaf
/tmp# exit
exit
/tmp# echo $$
1804
/tmp# grep /proc /proc/$$/mountinfo
grep: /proc/1804/mountinfo: No such file or directory
/tmp# ps -eaf
Error, do this: mount -t proc proc /proc
/tmp# mount -t proc proc /proc
/tmp# grep /proc /proc/$$/mountinfo
15 19 0:3 / /proc rw,nosuid,nodev,noexec,relatime shared:12 - proc proc rw
33 15 0:29 / /proc/sys/fs/binfmt_misc rw,relatime shared:20 - autofs systemd-1 rw,fd=22,pgrp=1,timeout=300,minproto=5,maxproto=5,direct
98 15 0:34 / /proc rw,relatime shared:27 - proc proc rw
69 98 0:3 / /proc rw,relatime shared:28 - proc proc rw
Why, after exiting the child process, is /proc
not mounted? Shouldn't any changes to mount points made by the child process not affect the parent's mount points? This seems to contradict the answer given by Emmet to the question https://stackoverflow.com/questions/22889241/linux-understanding-the-mount-namespace-clone-clone-newns-flag.
Best Answer
When a child process is created with
clone
with theCLONE_NEWNS
flag, the child process gets its own mount namespace. Mounting operations (mount
,umount
,mount --bind
, etc.) in the child namespace only have an effect inside that namespace, and mount operations in the parent namespace only have an effect outside the new namespace.Except, that is, for shared mounts. A mount can be shared, in which case operations affect all the namespaces that the mount is shared in. A typical use case for shared mounts is to make removable drives available in child namespaces such as chroots. There are more types of relationships (private mounts, unbindable mounts); for more details, see the kernel documentation.
You can check whether a mount is shared by checking
/proc/PID/mountinfo
: if the line containsshared:NUMBER
then the mount is shared, and the number is a unique value identifying the set of namespaces that it's shared between. If the line contains no such indication, the mount is private.On your system,
/proc
is shared. When you mount a new instance of proc in the child namespace, since you're mounting over the parent's/proc
, that new instance is also shared, so it's visible in both the child namespace and the parent namespace. When you exit the child namespace, the second instance of/proc
remains mounted, since it's shared with the still-active parent namespace.Two things complicate your scenario: you're also creating a PID namespace, and you're using
/proc
both as the subject of the experiment and as a means of observation. Whenps
complains about/proc
not being mounted, it's actually displaying a misleading error message — the wrongproc
is mounted (aproc
for the wrong namespace). You can observe this withls /proc
andcat /proc/1/mountinfo
. I recommend doing the experiments with a scratch filesystem, it would be easier to understand what was going on.So far it didn't matter whether
/proc
was private or shared, but now it does. If/proc
is private, then at this point we're observing the parent's/proc
, which was never affected and shows the PID namespace. But if/proc
is shared, then themount
command we issued earlier affected both namespaces, thus: