Linux – Did the pivot_root() documentation anticipate the feature of mount namespaces

initrdlinux-kernelnamespace

pivot_root() moves the root filesystem of the calling process to the directory put_old and makes new_root the new root filesystem of the
calling process.

The typical use of pivot_root() is during system startup, when the system mounts a temporary root filesystem (e.g., an initrd), then mounts
the real root filesystem, and eventually turns the latter into the current root of all relevant processes or threads.

pivot_root() may or may not change the current root and the current
working directory of any processes or threads which use the old root
directory. The caller of pivot_root() must ensure that processes with
root or current working directory at the old root operate correctly in
either case. An easy way to ensure this is to change their root and
current working directory to new_root before invoking pivot_root().

The paragraph above is intentionally vague because the implementation
of pivot_root() may change in the future. At the time of writing,
pivot_root() changes root and current working directory of each process
or thread to new_root if they point to the old root directory. This is
necessary in order to prevent kernel threads from keeping the old root
directory busy with their root and current working directory, even if
they never access the filesystem in any way. In the future, there may
be a mechanism for kernel threads to explicitly relinquish any access
to the filesystem, such that this fairly intrusive mechanism can be
removed from pivot_root().

BUGS

pivot_root() should not have to change root and current working direcā€
tory of all other processes in the system.

Some of the more obscure uses of pivot_root() may quickly lead to
insanity.

man pivot_root, Linux man-pages 4.15

I'm working on a case where there are multiple processes running at when pivot_root() is called.

The manpage doesn't seem very clear about how both possible implementations of pivot_root() can handle the case with multiple processes. Let's say we have two processes, S(ystemd) and P(lymouth). Currently, both P and S change their root and working directory to new_root, and then S calls pivot_root(). With the current implementation, this works fine.

Say both S and P "change their root directory" before pivot_root(), using chroot(). But, as man chroot tells us, it is possible to leave a chroot() jail if you are root (mkdir foo; chroot foo; cd ..; chroot .). It seems clear that processes have two associated roots:

  1. their current chroot
  2. the root of their mount namespace

After pivot_root(), S must observe that the root of its mount namespace is equal to its current chroot. Because if there was a deeper root filesystem that it could escape to at a future point, then that root filesystem would be busy and could not be unmounted. I think allowing the old root filesystem to be unmounted was the main purpose of pivot_root().

Currently, P observes the same thing – because it is in the same mount namespace as S.

It sounds like the alternative implementation of pivot_root() would put the calling process in a new, altered mount namespace. Is that a valid reading?

(I note this alternative implementation would make /sbin/pivot_root mostly pointless).

I believe the original pivot_root() actually predates mount namespaces. Do we know if this plan for an alternative implementation of pivot_root(), anticipated the need for some of the features of mount namespaces, or was this requirement overlooked?

(I note that mount namespaces also sound very much like a "mechanism for kernel threads to explicitly relinquish any access
to the filesystem", e.g. kernel threads could do the equivalent of pivot_root() into an empty tmpfs).

Best Answer

It sounds like the alternative implementation of pivot_root() would put the calling process in a new, altered mount namespace. Is that a valid reading?

No. IMO this is not very clear, but there is a much more consistent and correct reading.

The essential part of pivot_root(), which must be the same in either implementation, is:

pivot_root() moves the root filesystem of the calling process to the directory put_old and makes new_root the new root filesystem of the calling process.

The essential part of pivot_root() is not limited only to the calling process. The operation described in this quote works on the mount namespace of the calling process. It will affect the view of all the processes in the same mount namespace.

Consider the effect the essential change has on such a second process - or kernel thread - whose working directory was the old root filesystem. Its current directory will still be the old root filesystem. This will keep the /put_old mount point busy, and so it will not be possible to unmount the old root filesystem.

If you control this second process, you resolve this, as per the manpage, by setting its working directory to new_root before pivot_root() is called. After pivot_root() is called, its current directory will still be the new root filesystem.

So process S(ystemd) has been configured to signal process P(lymouth), to change working directory before S calls pivot_root(). No problem. But, we also have kernel threads, which start in /. The current implementation of pivot_root() takes care of the kernel threads for us; it is equivalent to setting the working directories of kernel threads and any other process to new_root before the essential part of pivot_root().

Except, the current implementation of pivot_root() only changes the working directory of a process if the old working directory was /. So it's actually quite easy to see the difference this makes:

$ unshare -rm
# cd /tmp    # work in a subdir instead of '/', and pivot_root() will not change it
# /bin/pwd
/tmp
# mount --bind /new-root /new-root
# pivot_root /new-root /new-root/mnt
# /bin/pwd
/mnt/tmp    # see below: if pivot_root had not updated our current chroot, this would still show /tmp

v.s.

$ unshare -rm
# cd /
# /bin/pwd
/
# ls -lid .
2 dr-xr-xr-x. 19 nfsnobody nfsnobody 4096 Jun 13 01:17 .
# ls -lid /newroot
6424395 dr-xr-xr-x. 20 nfsnobody nfsnobody 4096 May 10 12:53 /new-root
# mount --bind /new-root /new-root
# pivot_root /new-root /new-root/mnt
# /bin/pwd
/
# ls -lid .
6424395 dr-xr-xr-x. 20 nobody nobody 4096 May 10 12:53 .
# ls -lid /
6424395 dr-xr-xr-x. 20 nobody nobody 4096 May 10 12:53 /
# ls -lid /mnt
2 dr-xr-xr-x. 19 nobody nobody 4096 Jun 13 01:17 /mnt

Now I understand what's happening with the working directory, I find it easier to understand what's happening with chroot(). The current chroot of the process which calls pivot_root() may be a reference to the original root filesystem, just as its current working directory may be.

Note, if you do chdir()+pivot_root() but forgot to chroot(), your current directory would be outside your current chroot. When your current directory is outside your current chroot, things get quite confusing. You probably don't want to run your program in this state.

# cd /
# python
>>> import os
>>> os.chroot("/newroot")
>>> os.system("/bin/pwd")
(unreachable)/
0
>>> os.getcwd()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 2] No such file or directory
>>> os.system("ls -l ./proc/self/cwd")
lrwxrwxrwx. 1 root root 0 Jun 17 13:46 ./proc/self/cwd -> /
0
>>> os.system("ls -lid ./proc/self/cwd/")
2 dr-xr-xr-x. 19 root root 4096 Jun 13 01:17 ./proc/self/cwd/
0
>>> os.system("ls -lid /")
6424395 dr-xr-xr-x. 20 root root 4096 May 10 12:53 /
0

POSIX does not specify the result of pwd or getcwd() in this situation :). POSIX gives no warning that you might get an "No such file or directory" (ENOENT) error from getcwd(). Linux manpages point out this error as being possible, if the working directory was unlinked (e.g. with rm). I think this is a very good parallel.

Related Question