Linux nsenter Fork Before Exec – Why It’s Necessary for PID Namespace

containersdebianlinuxnamespaceprocess

I assume that nsenter, which runs as a sub-process of bash, utilizes the setns system call to join an existing namespace and then executes the specified program using exec.

But, if nsenter has already called setns before exec'ing, why is the fork system call needed to ensure that child processes will also be in the entered namespace?

man namespaces:

setns(2)
      The setns(2) system call allows the calling process to join an
      existing namespace.  The namespace to join is specified via a
      file descriptor that refers to one of the /proc/[pid]/ns files
      described below.

man nsenter:

...
-F, --no-fork
      Do not fork before exec'ing the specified program.  By
      default, when entering a PID namespace, nsenter calls fork
      before calling exec so that any children will also be in the
      newly entered PID namespace.

Best Answer

The explanation is given in the “PID namespace” section of man nsenter:

Children will have a set of PID to process mappings separate from the nsenter process. nsenter will fork by default if changing the PID namespace, so that the new program and its children share the same PID namespace and are visible to each other. If --no-fork is used, the new program will be exec'ed without forking.

(The manual is somewhat messed up there; I’ve cleaned the quoted section up above, and the next release of util-linux will include a fix.)

Entering a PID namespace doesn’t move the current process to that namespace, it only causes new children to be created in that namespace. As a result, the current process (the one calling setns) isn’t visible to its children in the new namespace. To avoid this, nsenter enters the new namespace, then forks, which results in a new nsenter in the new namespace, and then calls exec; as a result, the exec’ed program is in the new namespace.

See also the description of PID namespaces in man setns:

If fd refers to a PID namespace, the semantics are somewhat different from other namespace types: reassociating the calling thread with a PID namespace changes only the PID namespace that subsequently created child processes of the caller will be placed in; it does not change the PID namespace of the caller itself.

You’ll see this in action in the /proc namespace entries: /proc/.../ns has two PID entries, pid (the process’ namespace) and pid_for_children (the namespace used for new children).

(exec on its own doesn’t create new processes.)

Related Question