Reliable Way to Jail Child Processes Using Nsenter

namespaceprocess

I know that Linux namespaces, among many other things, can be leveraged to handle restricting and jailing child processes securely without any chance of their being zombied and dumped on init. But I'm fuzzy on implementation details. How might I use the tools provided by util-linux such as mount and nsenter to watch, monitor, and ensure that all processes launched are the direct namespace descendants of another process?

Best Answer

Create a PID namespace

The correct command to use here is unshare. Note that the necessary options to do this are only available from util-linux 2.23. The idea is to create a new PID namespace for the program you are running such that all its children are also created in this namespace. You can run a command in a new PID namespace simply by doing:

sudo unshare -fp some_command

To run a shell, just omit the command. This will create a process which, along with any of its children, will have a PID as usual within the parent (system) namespace. However, within the new namespace, it will have a PID of 1 along with some of the special characteristics of the init process. Perhaps the most relevant characteristic from a monitoring perspective is that if a any of its descendants are orphaned, they will be re-parented to this process rather than the real init process.

Simply doing this may be enough for most monitoring cases. As previously mentioned, the processes within the namespace all have PIDs within the parent namespace so regular commands can be used to monitor their activity. We are also assured that if any process in the namespace becomes orphaned, it will not fall out of the process tree branches beneath the PID of the the top level program meaning that it can still easily be kept track of.

Combine with a mount namespace

However, what we can't do is monitor the process with respect to the PID that it thinks that is has. To do this, and in particular to be able to use the ps command within the new namespace, you need to mount a separate procfs filesystem for the namespace. This in turn leads to another problem since the only location that ps accepts for procfs is /proc. One solution would be to create a chroot jail and mount the new procfs there. But this is a cumbersome approach as at a minimum we would need to copy (or at least hard link) any binaries that we intend to use along with any libraries they depend on to the new root.

The solution is to also use a new mount namespace. Within this we can mount the new procfs in a way that uses the true root /proc directory, can be usable within PID namespace and doesn't interfere with anything else. To make this process very simple, the unshare command gives the --mount-proc option:

sudo unshare -fp --mount-proc some_command

Now running ps within the combined namespaces will show only the processes with the PID namspace and it will show the top level process as having a PID of 1.

What about `nsenter`?

As the name suggests, nsenter can be used to enter a namespace that has already been created with unshare. This is useful if we want to get information only available from inside the namespace from an otherwise unrelated script. The simplest way is to access give the PID of any program running within the namespace. To be clear this must be the PID of the target program within the namespace from which nsenter is being run (since namespaces can be nested, it is possible for a single process to have many PIDs). To run a shell in the target PID/mount namespace, simply do:

sudo nsenter -t $PID -m -p

If this namespace is set up as above, ps will now list only processes within that namespace.

Related Solutions

Linux – Nice and Child Processes

A child process inherits whatever nice value is held by the parent at the time that it is forked (in your example, 5).

However, if the nice value of the parent process changes after forking the child processes, the child processes do not inherit the new nice value.

You can easily observe this with the monitoring tool top. If the nice field (NI) is not shown by default, you can add it by pressing f and choosing I. This will add the NI column to the top display.

* I: NI = Nice value

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
1937 root      20   0  206m  66m  45m S  6.2  1.7  11:03.67 X

Good information from man 2 fork

fork() creates a new process by duplicating the calling process. The new process, referred to as the child, is an exact duplicate of the calling process, referred to as the parent, except for the following points:

The child has its own unique process ID, and this PID does not match the ID of any existing process group (setpgid(2)).

The child's parent process ID is the same as the parent's process ID.

The child does not inherit its parent's memory locks (mlock(2), mlockall(2)).

Process resource utilizations (getrusage(2)) and CPU time counters (times(2)) are reset to zero in the child.

The child's set of pending signals is initially empty (sigpending(2)).

The child does not inherit semaphore adjustments from its parent (semop(2)).

The child does not inherit record locks from its parent (fcntl(2)).

The child does not inherit timers from its parent (setitimer(2), alarm(2), timer_create(2)).

The child does not inherit outstanding asynchronous I/O operations from its parent (aio_read(3), aio_write(3)), nor does it inherit any asynchronous I/O contexts from its parent (see io_setup(2)).

Bash – Finding number of child processes of a particular process

ps -eo ppid= | grep -Fwc $pid

If your grep does not support -w:

ps -eo ppid= | tr -d '[:blank:]' | grep -Fxc $pid

ps -eo ppid= | awk '$1==ppid {++i} END {print i+0}' ppid=$pid

or (clobbering the positional parameters)

set $(ps -eo ppid=); echo $#

Note that this is not atomic, so the count may be wrong if some processes die and others get spawned in the short span of time it takes to gather the data.