linux – What is the NSFS Filesystem

filesystemslinuxnamespacesnap

The kernel contains a filesystem, nsfs. snapd creates a nsfs mount under /run/snapd/ns/<snapname>.mnt for each installed snap. ls shows it as a 0 byte file.

The kernel source code does not seem to contain any documentation or comments about it. The main implementation seems to be here and the header file here.

From that, it seems to be namespace related.

A search of the repo does not even find Kconfig entries to enable or disable it…

What is the purpose of this filesystem and what is used for?

Best Answer

As described in the kernel commit log linked to by jiliagre above, the nsfs filesystem is a virtual filesystem making Linux-kernel namespaces available. It is separate from the /proc "proc" filesystem, where some process directory entries reference inodes in the nsfs filesystem in order to show which namespaces a certain process (or thread) is currently using.

The nsfs doesn't get listed in /proc/filesystems (while proc does), so it cannot be explicitly mounted. mount -t nsfs ./namespaces fails with "unknown filesystem type". This is, as nsfs as it is tightly interwoven with the proc filesystem.

The filesystem type nsfs only becomes visible via /proc/$PID/mountinfo when bind-mounting an existing(!) namespace filesystem link to another target. As Stephen Kitt rightly suggests above, this is to keep namespaces existing even if no process is using them anymore.

For example, create a new user namespace with a new network namespace, then bind-mount it, then exit: the namespace still exists, but lsns won't find it, since it's not listed in /proc/$PID/ns anymore, but exists as a (bind) mount point.

# bind mount only needs an inode, not necessarily a directory ;)
touch mynetns
# create new network namespace, show its id and then bind-mount it, so it
# is kept existing after the unshare'd bash has terminated.
# output: net:[##########]
NS=$(sudo unshare -n bash -c "readlink /proc/self/ns/net && mount --bind /proc/self/ns/net mynetns") && echo $NS
# notice how lsns cannot see this namespace anymore: no match!
lsns -t net | grep ${NS:5:-1} || echo "lsns: no match for net:[${NS:5:-1}]"
# however, findmnt does locate it on the nsfs...
findmnt -t nsfs | grep ${NS:5:-1} || echo "no match for net:[${NS:5:-1}]"
# output: /home/.../mynetns nsfs[net:[##########]] nsfs rw
# let the namespace go...
echo "unbinding + releasing network namespace"
sudo umount mynetns
findmnt -t nsfs | grep ${NS:5:-1} || echo "findmnt: no match for net:[${NS:5:-1}]"
# clean up
rm mynetns

Output should be similar to this one:

net:[4026532992]
lsns: no match for net:[4026532992]
/home/.../mynetns nsfs[net:[4026532992]] nsfs   rw
unbinding + releasing network namespace
findmnt: no match for net:[4026532992]

Please note that it is not possible to create namespaces via the nsfs filesystem, only via the syscalls clone() (CLONE_NEW...) and unshare. The nsfs only reflects the current kernel status w.r.t. namespaces, but it cannot create or destroy them.

Namespaces automatically get destroyed whenever there isn't any reference to them left, no processes (so no /proc/$PID/ns/...) AND no bind-mounts either, as we've explored in the above example.

Related Solutions

Linux – Did the pivot_root() documentation anticipate the feature of mount namespaces

It sounds like the alternative implementation of pivot_root() would put the calling process in a new, altered mount namespace. Is that a valid reading?

No. IMO this is not very clear, but there is a much more consistent and correct reading.

The essential part of pivot_root(), which must be the same in either implementation, is:

pivot_root() moves the root filesystem of the calling process to the directory put_old and makes new_root the new root filesystem of the calling process.

The essential part of pivot_root() is not limited only to the calling process. The operation described in this quote works on the mount namespace of the calling process. It will affect the view of all the processes in the same mount namespace.

Consider the effect the essential change has on such a second process - or kernel thread - whose working directory was the old root filesystem. Its current directory will still be the old root filesystem. This will keep the /put_old mount point busy, and so it will not be possible to unmount the old root filesystem.

If you control this second process, you resolve this, as per the manpage, by setting its working directory to new_root before pivot_root() is called. After pivot_root() is called, its current directory will still be the new root filesystem.

So process S(ystemd) has been configured to signal process P(lymouth), to change working directory before S calls pivot_root(). No problem. But, we also have kernel threads, which start in /. The current implementation of pivot_root() takes care of the kernel threads for us; it is equivalent to setting the working directories of kernel threads and any other process to new_root before the essential part of pivot_root().

Except, the current implementation of pivot_root() only changes the working directory of a process if the old working directory was /. So it's actually quite easy to see the difference this makes:

$ unshare -rm
# cd /tmp    # work in a subdir instead of '/', and pivot_root() will not change it
# /bin/pwd
/tmp
# mount --bind /new-root /new-root
# pivot_root /new-root /new-root/mnt
# /bin/pwd
/mnt/tmp    # see below: if pivot_root had not updated our current chroot, this would still show /tmp

v.s.

$ unshare -rm
# cd /
# /bin/pwd
/
# ls -lid .
2 dr-xr-xr-x. 19 nfsnobody nfsnobody 4096 Jun 13 01:17 .
# ls -lid /newroot
6424395 dr-xr-xr-x. 20 nfsnobody nfsnobody 4096 May 10 12:53 /new-root
# mount --bind /new-root /new-root
# pivot_root /new-root /new-root/mnt
# /bin/pwd
/
# ls -lid .
6424395 dr-xr-xr-x. 20 nobody nobody 4096 May 10 12:53 .
# ls -lid /
6424395 dr-xr-xr-x. 20 nobody nobody 4096 May 10 12:53 /
# ls -lid /mnt
2 dr-xr-xr-x. 19 nobody nobody 4096 Jun 13 01:17 /mnt

Now I understand what's happening with the working directory, I find it easier to understand what's happening with chroot(). The current chroot of the process which calls pivot_root() may be a reference to the original root filesystem, just as its current working directory may be.

Note, if you do chdir()+pivot_root() but forgot to chroot(), your current directory would be outside your current chroot. When your current directory is outside your current chroot, things get quite confusing. You probably don't want to run your program in this state.

# cd /
# python
>>> import os
>>> os.chroot("/newroot")
>>> os.system("/bin/pwd")
(unreachable)/
0
>>> os.getcwd()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 2] No such file or directory
>>> os.system("ls -l ./proc/self/cwd")
lrwxrwxrwx. 1 root root 0 Jun 17 13:46 ./proc/self/cwd -> /
0
>>> os.system("ls -lid ./proc/self/cwd/")
2 dr-xr-xr-x. 19 root root 4096 Jun 13 01:17 ./proc/self/cwd/
0
>>> os.system("ls -lid /")
6424395 dr-xr-xr-x. 20 root root 4096 May 10 12:53 /
0

POSIX does not specify the result of pwd or getcwd() in this situation :). POSIX gives no warning that you might get an "No such file or directory" (ENOENT) error from getcwd(). Linux manpages point out this error as being possible, if the working directory was unlinked (e.g. with rm). I think this is a very good parallel.

Linux – file that associates a thread to its network namespace

There is a file that associates a thread to its network namespace:

/proc/[PID]/task/[TID]/ns/net

where TID is the thread ID. This solved my issue.

Best Answer

Related Solutions

Linux – Did the pivot_root() documentation anticipate the feature of mount namespaces

Linux – file that associates a thread to its network namespace

Related Question