Linux – Did the pivot_root() documentation anticipate the feature of mount namespaces

initrdlinux-kernelnamespace

pivot_root() moves the root filesystem of the calling process to the directory put_old and makes new_root the new root filesystem of the
calling process.

The typical use of pivot_root() is during system startup, when the system mounts a temporary root filesystem (e.g., an initrd), then mounts
the real root filesystem, and eventually turns the latter into the current root of all relevant processes or threads.

pivot_root() may or may not change the current root and the current
working directory of any processes or threads which use the old root
directory. The caller of pivot_root() must ensure that processes with
root or current working directory at the old root operate correctly in
either case. An easy way to ensure this is to change their root and
current working directory to new_root before invoking pivot_root().

The paragraph above is intentionally vague because the implementation
of pivot_root() may change in the future. At the time of writing,
pivot_root() changes root and current working directory of each process
or thread to new_root if they point to the old root directory. This is
necessary in order to prevent kernel threads from keeping the old root
directory busy with their root and current working directory, even if
they never access the filesystem in any way. In the future, there may
be a mechanism for kernel threads to explicitly relinquish any access
to the filesystem, such that this fairly intrusive mechanism can be
removed from pivot_root().

…

BUGS

pivot_root() should not have to change root and current working direc‐
tory of all other processes in the system.

Some of the more obscure uses of pivot_root() may quickly lead to
insanity.

— man pivot_root, Linux man-pages 4.15

I'm working on a case where there are multiple processes running at when pivot_root() is called.

The manpage doesn't seem very clear about how both possible implementations of pivot_root() can handle the case with multiple processes. Let's say we have two processes, S(ystemd) and P(lymouth). Currently, both P and S change their root and working directory to new_root, and then S calls pivot_root(). With the current implementation, this works fine.

Say both S and P "change their root directory" before pivot_root(), using chroot(). But, as man chroot tells us, it is possible to leave a chroot() jail if you are root (mkdir foo; chroot foo; cd ..; chroot .). It seems clear that processes have two associated roots:

their current chroot
the root of their mount namespace

After pivot_root(), S must observe that the root of its mount namespace is equal to its current chroot. Because if there was a deeper root filesystem that it could escape to at a future point, then that root filesystem would be busy and could not be unmounted. I think allowing the old root filesystem to be unmounted was the main purpose of pivot_root().

Currently, P observes the same thing – because it is in the same mount namespace as S.

It sounds like the alternative implementation of pivot_root() would put the calling process in a new, altered mount namespace. Is that a valid reading?

(I note this alternative implementation would make /sbin/pivot_root mostly pointless).

I believe the original pivot_root() actually predates mount namespaces. Do we know if this plan for an alternative implementation of pivot_root(), anticipated the need for some of the features of mount namespaces, or was this requirement overlooked?

(I note that mount namespaces also sound very much like a "mechanism for kernel threads to explicitly relinquish any access
to the filesystem", e.g. kernel threads could do the equivalent of pivot_root() into an empty tmpfs).

Best Answer

It sounds like the alternative implementation of pivot_root() would put the calling process in a new, altered mount namespace. Is that a valid reading?

No. IMO this is not very clear, but there is a much more consistent and correct reading.

The essential part of pivot_root(), which must be the same in either implementation, is:

pivot_root() moves the root filesystem of the calling process to the directory put_old and makes new_root the new root filesystem of the calling process.

The essential part of pivot_root() is not limited only to the calling process. The operation described in this quote works on the mount namespace of the calling process. It will affect the view of all the processes in the same mount namespace.

Consider the effect the essential change has on such a second process - or kernel thread - whose working directory was the old root filesystem. Its current directory will still be the old root filesystem. This will keep the /put_old mount point busy, and so it will not be possible to unmount the old root filesystem.

If you control this second process, you resolve this, as per the manpage, by setting its working directory to new_root before pivot_root() is called. After pivot_root() is called, its current directory will still be the new root filesystem.

So process S(ystemd) has been configured to signal process P(lymouth), to change working directory before S calls pivot_root(). No problem. But, we also have kernel threads, which start in /. The current implementation of pivot_root() takes care of the kernel threads for us; it is equivalent to setting the working directories of kernel threads and any other process to new_root before the essential part of pivot_root().

Except, the current implementation of pivot_root() only changes the working directory of a process if the old working directory was /. So it's actually quite easy to see the difference this makes:

$ unshare -rm
# cd /tmp    # work in a subdir instead of '/', and pivot_root() will not change it
# /bin/pwd
/tmp
# mount --bind /new-root /new-root
# pivot_root /new-root /new-root/mnt
# /bin/pwd
/mnt/tmp    # see below: if pivot_root had not updated our current chroot, this would still show /tmp

v.s.

$ unshare -rm
# cd /
# /bin/pwd
/
# ls -lid .
2 dr-xr-xr-x. 19 nfsnobody nfsnobody 4096 Jun 13 01:17 .
# ls -lid /newroot
6424395 dr-xr-xr-x. 20 nfsnobody nfsnobody 4096 May 10 12:53 /new-root
# mount --bind /new-root /new-root
# pivot_root /new-root /new-root/mnt
# /bin/pwd
/
# ls -lid .
6424395 dr-xr-xr-x. 20 nobody nobody 4096 May 10 12:53 .
# ls -lid /
6424395 dr-xr-xr-x. 20 nobody nobody 4096 May 10 12:53 /
# ls -lid /mnt
2 dr-xr-xr-x. 19 nobody nobody 4096 Jun 13 01:17 /mnt

Now I understand what's happening with the working directory, I find it easier to understand what's happening with chroot(). The current chroot of the process which calls pivot_root() may be a reference to the original root filesystem, just as its current working directory may be.

Note, if you do chdir()+pivot_root() but forgot to chroot(), your current directory would be outside your current chroot. When your current directory is outside your current chroot, things get quite confusing. You probably don't want to run your program in this state.

# cd /
# python
>>> import os
>>> os.chroot("/newroot")
>>> os.system("/bin/pwd")
(unreachable)/
0
>>> os.getcwd()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 2] No such file or directory
>>> os.system("ls -l ./proc/self/cwd")
lrwxrwxrwx. 1 root root 0 Jun 17 13:46 ./proc/self/cwd -> /
0
>>> os.system("ls -lid ./proc/self/cwd/")
2 dr-xr-xr-x. 19 root root 4096 Jun 13 01:17 ./proc/self/cwd/
0
>>> os.system("ls -lid /")
6424395 dr-xr-xr-x. 20 root root 4096 May 10 12:53 /
0

POSIX does not specify the result of pwd or getcwd() in this situation :). POSIX gives no warning that you might get an "No such file or directory" (ENOENT) error from getcwd(). Linux manpages point out this error as being possible, if the working directory was unlinked (e.g. with rm). I think this is a very good parallel.

Create a PID namespace

The correct command to use here is unshare. Note that the necessary options to do this are only available from util-linux 2.23. The idea is to create a new PID namespace for the program you are running such that all its children are also created in this namespace. You can run a command in a new PID namespace simply by doing:

sudo unshare -fp some_command

To run a shell, just omit the command. This will create a process which, along with any of its children, will have a PID as usual within the parent (system) namespace. However, within the new namespace, it will have a PID of 1 along with some of the special characteristics of the init process. Perhaps the most relevant characteristic from a monitoring perspective is that if a any of its descendants are orphaned, they will be re-parented to this process rather than the real init process.

Simply doing this may be enough for most monitoring cases. As previously mentioned, the processes within the namespace all have PIDs within the parent namespace so regular commands can be used to monitor their activity. We are also assured that if any process in the namespace becomes orphaned, it will not fall out of the process tree branches beneath the PID of the the top level program meaning that it can still easily be kept track of.

Combine with a mount namespace

However, what we can't do is monitor the process with respect to the PID that it thinks that is has. To do this, and in particular to be able to use the ps command within the new namespace, you need to mount a separate procfs filesystem for the namespace. This in turn leads to another problem since the only location that ps accepts for procfs is /proc. One solution would be to create a chroot jail and mount the new procfs there. But this is a cumbersome approach as at a minimum we would need to copy (or at least hard link) any binaries that we intend to use along with any libraries they depend on to the new root.

The solution is to also use a new mount namespace. Within this we can mount the new procfs in a way that uses the true root /proc directory, can be usable within PID namespace and doesn't interfere with anything else. To make this process very simple, the unshare command gives the --mount-proc option:

sudo unshare -fp --mount-proc some_command

Now running ps within the combined namespaces will show only the processes with the PID namspace and it will show the top level process as having a PID of 1.

What about `nsenter`?

As the name suggests, nsenter can be used to enter a namespace that has already been created with unshare. This is useful if we want to get information only available from inside the namespace from an otherwise unrelated script. The simplest way is to access give the PID of any program running within the namespace. To be clear this must be the PID of the target program within the namespace from which nsenter is being run (since namespaces can be nested, it is possible for a single process to have many PIDs). To run a shell in the target PID/mount namespace, simply do:

sudo nsenter -t $PID -m -p

If this namespace is set up as above, ps will now list only processes within that namespace.

Linux – What happens if the last process in a namespace exits

I was wondering the same thing, and so I ran a little test (on kernel 4.20.0, using unshare from util-linux 2.33; the manpage for unshare in that version has some notes on shared/private mounts that are worth reading, and YMMV if you are using an older version).

TL;DR: Yes, the filesystem is unmounted when the last process in the namespace exits.

In my case, the device I'm testing with is dm-6, and it is not mounted in the "outer" namespace.

Window 1:

cd /sys/fs/ext4
ls -d dm-6
# No such file or directory

Window 2:

unshare -m
mount /dev/dm-6 /mnt/tmp
# don't exit yet, keep the namespace active

Window 3: Do the same thing as window 1.

Window 1:

ls -d dm-6
# exists now

Window 2: Exit the unshare environment

Window 1: Check again, dm-6 is still there

Window 3: Exit the unshare environment

Window 1: Check again, dm-6 is gone again

Another useful demo / test: Similar idea, but instead of having 3 windows, enter and exit Window 2 twice. Check dmesg or logs, and verified that the kernel message that it mounted the filesystem appears twice in this case.

Best Answer

Related Solutions

Reliable Way to Jail Child Processes Using Nsenter

Create a PID namespace

Combine with a mount namespace

What about nsenter?

Linux – What happens if the last process in a namespace exits

Related Question

What about `nsenter`?