Linux nsenter Fork Before Exec – Why It’s Necessary for PID Namespace

containersdebianlinuxnamespaceprocess

I assume that nsenter, which runs as a sub-process of bash, utilizes the setns system call to join an existing namespace and then executes the specified program using exec.

But, if nsenter has already called setns before exec'ing, why is the fork system call needed to ensure that child processes will also be in the entered namespace?

man namespaces:

setns(2)
      The setns(2) system call allows the calling process to join an
      existing namespace.  The namespace to join is specified via a
      file descriptor that refers to one of the /proc/[pid]/ns files
      described below.

man nsenter:

...
-F, --no-fork
      Do not fork before exec'ing the specified program.  By
      default, when entering a PID namespace, nsenter calls fork
      before calling exec so that any children will also be in the
      newly entered PID namespace.

Best Answer

The explanation is given in the “PID namespace” section of man nsenter:

Children will have a set of PID to process mappings separate from the nsenter process. nsenter will fork by default if changing the PID namespace, so that the new program and its children share the same PID namespace and are visible to each other. If --no-fork is used, the new program will be exec'ed without forking.

(The manual is somewhat messed up there; I’ve cleaned the quoted section up above, and the next release of util-linux will include a fix.)

Entering a PID namespace doesn’t move the current process to that namespace, it only causes new children to be created in that namespace. As a result, the current process (the one calling setns) isn’t visible to its children in the new namespace. To avoid this, nsenter enters the new namespace, then forks, which results in a new nsenter in the new namespace, and then calls exec; as a result, the exec’ed program is in the new namespace.

See also the description of PID namespaces in man setns:

If fd refers to a PID namespace, the semantics are somewhat different from other namespace types: reassociating the calling thread with a PID namespace changes only the PID namespace that subsequently created child processes of the caller will be placed in; it does not change the PID namespace of the caller itself.

You’ll see this in action in the /proc namespace entries: /proc/.../ns has two PID entries, pid (the process’ namespace) and pid_for_children (the namespace used for new children).

(exec on its own doesn’t create new processes.)

Patching `unshare` to make killing work reliably

Since asking my question, in the last couple hours I've written a patch to util-linux (pull request here) that adds a --kill-child flag to unshare.

Edit: This is now merged and released as part of util-linux v2.32.

You can use it like this:

unshare -fp --kill-child -- bash -c "watch /bin/sleep 10000 && echo hi"

and killing unshare will tear the entire process tree down as expeted.

Without `root`

You can even use it without root privileges if your kernel has user namespaces enabled (CONFIG_USER_NS=y) by passing the -U flag:

unshare -Ufp --kill-child -- bash -c "watch /bin/sleep 10000 && echo hi"

Linux Kernel – Understanding Mount Namespace Functionality

Both the current working directory, and the root, are reset to the root filesystem of the entered mount namespace.

For example, I have tested that I can escape chroot by running nsenter -m --target $$.

(Reminder: chroot is easy to escape when you are still root. man chroot documents the well-known way of doing this).

Source

https://elixir.bootlin.com/linux/latest/source/fs/namespace.c?v=4.17#L3507

static int mntns_install(struct nsproxy *nsproxy, struct ns_common *ns)
{
    struct fs_struct *fs = current->fs;

Note: current means the current task - the current thread/process.

->fs will be the filesystem data of that task - this is shared between tasks that are threads within the same process. E.g. you will see below that changing the working directory is an operation on ->fs.

E.g. changing the working directory affects all threads of the same process. POSIX-compatible threads like this are implemented using the CLONE_FS flag of clone().

    struct mnt_namespace *mnt_ns = to_mnt_ns(ns), *old_mnt_ns;
    struct path root;
    int err;

...

    /* Find the root */
    err = vfs_path_lookup(mnt_ns->root->mnt.mnt_root, &mnt_ns->root->mnt,
                "/", LOOKUP_DOWN, &root);

here is the line in question:

    /* Update the pwd and root */
    set_fs_pwd(fs, &root);
    set_fs_root(fs, &root);

...

}

...

const struct proc_ns_operations mntns_operations = {
    .name       = "mnt",
    .type       = CLONE_NEWNS,
    .get        = mntns_get,
    .put        = mntns_put,
    .install    = mntns_install,
    .owner      = mntns_owner,
};

Best Answer

Related Solutions

Process Kill – Why Does Unshare Based Killing Only Work Reliably with –fork?

Patching unshare to make killing work reliably

Without root

Linux Kernel – Understanding Mount Namespace Functionality

Source

Related Question

Patching `unshare` to make killing work reliably

Without `root`