As described in this helpful answer the effects of -n
(unshare(CLONE_NEWPID)
) only come into acction for the first child process forked.
In particular, man 2 unshare says says about CLONE_NEWPID
:
Unshare the PID namespace, so that the calling process has a new PID namespace for its children which is not shared with any previously existing process.
The calling process is not moved into the new namespace.
The first child created by the calling process will have the process ID 1 and will assume the role of init(1)
in the new namespace.
Patching unshare
to make killing work reliably
Since asking my question, in the last couple hours I've written a patch to util-linux
(pull request here) that adds a --kill-child
flag to unshare
.
Edit: This is now merged and released as part of util-linux v2.32
.
You can use it like this:
unshare -fp --kill-child -- bash -c "watch /bin/sleep 10000 && echo hi"
and killing unshare
will tear the entire process tree down as expeted.
Without root
You can even use it without root
privileges if your kernel has user namespaces enabled (CONFIG_USER_NS=y
) by passing the -U
flag:
unshare -Ufp --kill-child -- bash -c "watch /bin/sleep 10000 && echo hi"
Both the current working directory, and the root, are reset to the root filesystem of the entered mount namespace.
For example, I have tested that I can escape chroot
by running nsenter -m --target $$
.
(Reminder: chroot
is easy to escape when you are still root. man chroot
documents the well-known way of doing this).
Source
https://elixir.bootlin.com/linux/latest/source/fs/namespace.c?v=4.17#L3507
static int mntns_install(struct nsproxy *nsproxy, struct ns_common *ns)
{
struct fs_struct *fs = current->fs;
Note: current
means the current task - the current thread/process.
->fs
will be the filesystem data of that task - this is shared between tasks that are threads within the same process. E.g. you will see below that changing the working directory is an operation on ->fs
.
E.g. changing the working directory affects all threads of the same process. POSIX-compatible threads like this are implemented using the CLONE_FS flag of clone().
struct mnt_namespace *mnt_ns = to_mnt_ns(ns), *old_mnt_ns;
struct path root;
int err;
...
/* Find the root */
err = vfs_path_lookup(mnt_ns->root->mnt.mnt_root, &mnt_ns->root->mnt,
"/", LOOKUP_DOWN, &root);
here is the line in question:
/* Update the pwd and root */
set_fs_pwd(fs, &root);
set_fs_root(fs, &root);
...
}
...
const struct proc_ns_operations mntns_operations = {
.name = "mnt",
.type = CLONE_NEWNS,
.get = mntns_get,
.put = mntns_put,
.install = mntns_install,
.owner = mntns_owner,
};
Best Answer
The explanation is given in the “PID namespace” section of
man nsenter
:(The manual is somewhat messed up there; I’ve cleaned the quoted section up above, and the next release of
util-linux
will include a fix.)Entering a PID namespace doesn’t move the current process to that namespace, it only causes new children to be created in that namespace. As a result, the current process (the one calling
setns
) isn’t visible to its children in the new namespace. To avoid this,nsenter
enters the new namespace, then forks, which results in a newnsenter
in the new namespace, and then callsexec
; as a result, the exec’ed program is in the new namespace.See also the description of PID namespaces in
man setns
:You’ll see this in action in the
/proc
namespace entries:/proc/.../ns
has two PID entries,pid
(the process’ namespace) andpid_for_children
(the namespace used for new children).(
exec
on its own doesn’t create new processes.)