It sounds like the alternative implementation of pivot_root() would put the calling process in a new, altered mount namespace. Is that a valid reading?
No. IMO this is not very clear, but there is a much more consistent and correct reading.
The essential part of pivot_root(), which must be the same in either implementation, is:
pivot_root() moves the root filesystem of the calling process to the directory put_old and makes new_root the new root filesystem of the calling process.
The essential part of pivot_root() is not limited only to the calling process. The operation described in this quote works on the mount namespace of the calling process. It will affect the view of all the processes in the same mount namespace.
Consider the effect the essential change has on such a second process - or kernel thread - whose working directory was the old root filesystem. Its current directory will still be the old root filesystem. This will keep the /put_old
mount point busy, and so it will not be possible to unmount the old root filesystem.
If you control this second process, you resolve this, as per the manpage, by setting its working directory to new_root before pivot_root() is called. After pivot_root() is called, its current directory will still be the new root filesystem.
So process S(ystemd) has been configured to signal process P(lymouth), to change working directory before S calls pivot_root(). No problem. But, we also have kernel threads, which start in /
. The current implementation of pivot_root() takes care of the kernel threads for us; it is equivalent to setting the working directories of kernel threads and any other process to new_root
before the essential part of pivot_root().
Except, the current implementation of pivot_root() only changes the working directory of a process if the old working directory was /
. So it's actually quite easy to see the difference this makes:
$ unshare -rm
# cd /tmp # work in a subdir instead of '/', and pivot_root() will not change it
# /bin/pwd
/tmp
# mount --bind /new-root /new-root
# pivot_root /new-root /new-root/mnt
# /bin/pwd
/mnt/tmp # see below: if pivot_root had not updated our current chroot, this would still show /tmp
v.s.
$ unshare -rm
# cd /
# /bin/pwd
/
# ls -lid .
2 dr-xr-xr-x. 19 nfsnobody nfsnobody 4096 Jun 13 01:17 .
# ls -lid /newroot
6424395 dr-xr-xr-x. 20 nfsnobody nfsnobody 4096 May 10 12:53 /new-root
# mount --bind /new-root /new-root
# pivot_root /new-root /new-root/mnt
# /bin/pwd
/
# ls -lid .
6424395 dr-xr-xr-x. 20 nobody nobody 4096 May 10 12:53 .
# ls -lid /
6424395 dr-xr-xr-x. 20 nobody nobody 4096 May 10 12:53 /
# ls -lid /mnt
2 dr-xr-xr-x. 19 nobody nobody 4096 Jun 13 01:17 /mnt
Now I understand what's happening with the working directory, I find it easier to understand what's happening with chroot(). The current chroot of the process which calls pivot_root() may be a reference to the original root filesystem, just as its current working directory may be.
Note, if you do chdir()+pivot_root() but forgot to chroot(), your current directory would be outside your current chroot. When your current directory is outside your current chroot, things get quite confusing. You probably don't want to run your program in this state.
# cd /
# python
>>> import os
>>> os.chroot("/newroot")
>>> os.system("/bin/pwd")
(unreachable)/
0
>>> os.getcwd()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OSError: [Errno 2] No such file or directory
>>> os.system("ls -l ./proc/self/cwd")
lrwxrwxrwx. 1 root root 0 Jun 17 13:46 ./proc/self/cwd -> /
0
>>> os.system("ls -lid ./proc/self/cwd/")
2 dr-xr-xr-x. 19 root root 4096 Jun 13 01:17 ./proc/self/cwd/
0
>>> os.system("ls -lid /")
6424395 dr-xr-xr-x. 20 root root 4096 May 10 12:53 /
0
POSIX does not specify the result of pwd
or getcwd() in this situation :). POSIX gives no warning that you might get an "No such file or directory" (ENOENT) error from getcwd(). Linux manpages point out this error as being possible, if the working directory was unlinked (e.g. with rm
). I think this is a very good parallel.
Best Answer
As described in the kernel commit log linked to by jiliagre above, the
nsfs
filesystem is a virtual filesystem making Linux-kernel namespaces available. It is separate from the/proc
"proc" filesystem, where some process directory entries reference inodes in thensfs
filesystem in order to show which namespaces a certain process (or thread) is currently using.The
nsfs
doesn't get listed in/proc/filesystems
(whileproc
does), so it cannot be explicitly mounted.mount -t nsfs ./namespaces
fails with "unknown filesystem type". This is, asnsfs
as it is tightly interwoven with theproc
filesystem.The filesystem type
nsfs
only becomes visible via/proc/$PID/mountinfo
when bind-mounting an existing(!) namespace filesystem link to another target. As Stephen Kitt rightly suggests above, this is to keep namespaces existing even if no process is using them anymore.For example, create a new user namespace with a new network namespace, then bind-mount it, then exit: the namespace still exists, but
lsns
won't find it, since it's not listed in/proc/$PID/ns
anymore, but exists as a (bind) mount point.Output should be similar to this one:
Please note that it is not possible to create namespaces via the nsfs filesystem, only via the syscalls clone() (
CLONE_NEW...
) and unshare. Thensfs
only reflects the current kernel status w.r.t. namespaces, but it cannot create or destroy them.Namespaces automatically get destroyed whenever there isn't any reference to them left, no processes (so no
/proc/$PID/ns/...
) AND no bind-mounts either, as we've explored in the above example.