Linux – Understanding how mount namespaces work in Linux

filesystemslinuxmountnamespacevirtual-file-system

I am reading about mount namespaces and see:

in a mount namespace you can mount and unmount filesystems without it affecting the host filesystem. So you can have a totally different set of devices mounted (usually less).

I am trying to understand linux namespaces, and LXC and such, but I don't quite understand what that statement above means.

What I'm trying to understand is how a container (1) can have files like this:

/foo/a.txt
/foo/bar/b.txt

And another container (2) can have files like this:

/foo/a.txt
/foo/x.txt
/foo/bar/b.txt
/foo/bar/y.txt

Where /foo/a.txt and /foo/bar/b.txt on containers (1) and (2) are the same path, but perhaps they have different content:

# container (1)
cat /foo/a.txt #=> Hello from (1)

# container (2)
cat /foo/a.txt #=> Hello from (2)

This would mean that the files on the physical system (which I don't know anything about) are stored in one way, perhaps scattered all around. But then there is a centralized database of "virtual" files in the operating system, like this:

db:
  container1:
    foo:
      a.txt: Hello from a from (1)
      bar:
        b.txt: Hello from b from (1)
  container2:
    foo:
      a.txt: Hello from a from (2)
      x.txt: Hello from x from (2)
      bar:
        b.txt: Hello from b from (2)
        y.txt: Hello from y from (2)

Then there is another OS database for the physical files which might look like this:

drive1:
  dir1:
    foo:
      a.txt
      bar:
        b.txt
  dir2:
    foo:
      a.txt
      x.txt
      bar:
        b.txt
        y.txt

So when you create a file in the container, you are actually creating 2 new records:

1 for the drive-level physical files map
1 for the container-level virtual files map

This is how I imagine it to work. This is how I can see there being a way to (1) present the user (in an LXC container or cgroup (which I don't know much about)) with what feels like a complete "file system", in which they can (2) create their own fully-customizable directory structure (that may have the same named files/directories/paths as a completely different virtual file system), such that (3) the files from multiple different virtual file systems / containers don't override each other.

Wondering if this is how it works, or if not, how it actually works (or an outline of how it works).

Best Answer

mount namespaces differ in the arrangement of mounted filesystems.

This is very flexible, because mounts can be bind mounts of a sub-directory within a filesystem.

# unshare --mount  # run a shell in a new mount namespace

# mount --bind /usr/bin/ /mnt/
# ls /mnt/cp
/mnt/cp

# exit  # exit the shell, and hence the mount namespace

# ls /mnt/cp
ls: cannot access '/mnt/cp': No such file or directory

You can list your current set of mounts with the findmnt command.

In a full container, the root mount is replaced and you work with an entirely separate tree of mounts. This involves some extra details, such as the pivot_root() system call. You probably don't need to know exactly how to do that. Some details are available here: How to perform chroot with Linux namespaces?

How to use pivot_root

Immediately after entering the mount namespace, you also need mount --make-rslave / or equivalent. Otherwise, all your mount changes propagate to the mounts in the original namespace, including the pivot_root. You don't want that :).

If you used the unshare --mount command, note it is documented to apply mount --make-rprivate by default. AFAICS this is a bad default and you don't want this in production code. E.g. at this point, it would stop eject from working on a mounted DVD or USB in the host namespace. The DVD or USB would remain mounted inside the private mount tree, and the kernel would not let you eject the DVD.

Once you've done that, you can mount e.g. the /proc directory you will be using. The same way you would for chroot.

Unlike when you use chroot, pivot_root requires that your new root filesystem is a mount point. If it is not one already, you can satisfy this by simply applying a bind mount: mount --rbind new_root new_root.

Use pivot_root - and then umount the old root filesystem, with the -l / MNT_DETACH option. (You don't need umount -R, which can take longer.).

Technically, using pivot_root generally needs to involve using chroot as well; it's not "either-or".

As per man 2 pivot_root, it's only defined as swapping the root of the mount namespace. It isn't defined to change which physical directory the process root is pointing to. Or the current working directory (/proc/self/cwd). It happens that it does do so, but this is a hack to handle kernel threads. The manpage says that could change in future.

Usually you want this sequence:

chdir(new_root);            // cd new_root
pivot_root(".", put_old);   // pivot_root . put_old
chroot(".");                // chroot .

The postition of the chroot in this sequence is yet another subtle detail. Although the point of pivot_root is to rearrange the mount namespace, the kernel code seems to find the root filesystem to move by looking at the per-process root, which is what chroot sets.

Why to use pivot_root

In principle, it makes sense to use pivot_root for security and isolation. I like to think about the theory of capability-based security. You pass in a list of the specific resources needed, and the process can access no other resources. In this case we are talking about the filesystems passed in to a mount namespace. This idea applies generally to the Linux "namespaces" feature, though I'm probably not expressing it very well.

chroot only sets the process root, but the process still refers to the full mount namespace. If a process retains the privilege to perform chroot, then it can traverse back up the filesystem namespace. As detailed in man 2 chroot, "the superuser can escape from a 'chroot jail' by...".

Another thought-provoking way to undo chroot is nsenter --mount=/proc/self/ns/mnt. This is perhaps a stronger argument for the principle. nsenter / setns() necessarily re-loads the process root, from the root of the mount namespace... although the fact that this works when the two refer to different physical directories, might be considered a kernel bug. (Technical note: there could be multiple filesystems mounted on top of each other at the root; setns() uses the top, most recently mounted one).

This illustrates one advantage of combining a mount namespace with a "PID namespace". Being inside a PID namespace would prevent you from entering the mount namespace of an unconfined process. It also prevents you entering the root of an unconfined process (/proc/$PID/root). And of course a PID namespace also prevents you from killing any process which is outside it :-).

Linux Network Interfaces – How to Find All Configured Interfaces Including Containers

An interface, at a given time, belongs to one network namespace and only one. The init (initial) network namespace, except for inheriting physical interfaces of destroyed network namespaces has no special ability over other network namespaces: it can't see directly their interfaces. As long as you are still in init's pid and mount namespaces, you can still find the network namespaces by using different informations available from /proc and finally display their interfaces by entering those network namespaces.

I'll provide examples in shell.

enumerate the network namespaces

For this you have to know how those namespaces are existing: as long as a resource keep them up. A resource here can be a process (actually a process' thread), a mount point or an open file descriptor (fd). Those resources are all referenced in /proc/ and point to an abstract pseudo-file in the nsfs pseudo-filesystem enumerating all namespaces. This file's only meaningful information is its inode, representing the network namespace, but the inode can't be manipulated alone, it has to be the file. That's why later we can't just keep only the inode value (given by stat -c %i /proc/some/file): we'll keep the inode to be able to remove duplicates and a filename to still have an usable reference for nsenter later.
- process (actually thread)
  
  The most common case: for usual containers. Each thread's network namespace can be known via the reference /proc/pid/ns/net: just stat them and enumerate all unique namespaces. The 2>/dev/null is to hide when stat can't find ephemeral processes anymore.
```
find /proc/ -mindepth 1 -maxdepth 1 -name '[1-9]*' | while read -r procpid; do
        stat -L -c '%20i %n' $procpid/ns/net
done 2>/dev/null
```
  This can be done faster with the specialized lsns command which deals with namespaces, but appears to handle only processes (not mount points nor open fd as seen later):
```
lsns -n -u -t net -o NS,PATH
```
  (which would have to be reformatted for later as lsns -n -u -t net -o NS,PATH | while read inode path; do printf '%20u %s\n' $inode "$path"; done)
- mount point
  
  Those are mostly used by the ip netns add command which creates permanent network namespaces by mounting them, thus avoiding them disappearing when there is no process nor fd resource keeping them up, then also allowing for example to run a router, firewall or bridge in a network namespace without any linked process.
  
  Mounted namespaces (handling of mount and perhaps pid namespaces is probably more complex but we're only interested in network namespaces anyway) appear like any other mount point in /proc/mounts, with the filesystem type nsfs. There's no easy way in shell to distinguish a network namespace from an other type of namespace, but since two pseudo-files from the same filesystem (here nsfs) won't share the same inode, just elect them all and ignore errors later in the interface step when trying to use a non-network namespace reference as network namespace. Sorry, below I won't handle correctly mount points with special characters in them, including spaces, because they are already escaped in /proc/mounts's output (it would be easier in any other language), so I won't bother either to use null terminated lines.
```
awk '$3 == "nsfs" { print $2 }' /proc/mounts | while read -r mount; do
        stat -c '%20i %n' "$mount"
done
```
- open file descriptor
  
  Those are probably even more rare than mount points except temporarily at namespace creation, but might be held and used by some specialized application handling multiple namespaces, including possibly some containerization technology.
  
  I couldn't devise a better method than search all fd available in every /proc/pid/fd/, using stat to verify it points to a nsfs namespace and again not caring for now if it's really a network namespace. I'm sure there's a more optimized loop, but this one at least won't wander everywhere nor assume any maximum process limit.
```
find /proc/ -mindepth 1 -maxdepth 1 -name '[1-9]*' | while read -r procpid; do
        find $procpid/fd -mindepth 1 | while read -r procfd; do
                if [ "$(stat -f -c %T $procfd)" = nsfs ]; then
                        stat -L -c '%20i %n' $procfd 
                fi
        done
done 2>/dev/null
```
Now remove all duplicate network namespace references from previous results. Eg by using this filter on the combined output of the 3 previous results (especially from the open file descriptor part):
```
sort -k 1n | uniq -w 20
```
in each namespace enumerate the interfaces

Now we have the references to all the existing network namespaces (and also some non-network namespaces which we'll just ignore), simply enter each of them using the reference and display the interfaces.

Take the previous commands' output as input to this loop to enumerate interfaces (and as per OP's question, choose to display their addresses), while ignoring errors caused by non-network namespaces as previously explained:
```
while read -r inode reference; do
    if nsenter --net="$reference" ip -br address show 2>/dev/null; then
            printf 'end of network %d\n\n' $inode
    fi
done
```

The init network's inode can be printed with pid 1 as reference:

echo -n 'INIT NETWORK: ' ; stat -L -c %i /proc/1/ns/net

Example (real but redacted) output with a running LXC container,an empty "mounted" network namepace created with ip netns add ... having an unconnected bridge interface, a network namespace with an other dummy0 interface, kept alive by a process not in this network namespace but keeping an open fd on it, created with:

unshare --net sh -c 'ip link add dummy0 type dummy; ip address add dev dummy0 10.11.12.13/24; sleep 3' & sleep 1; sleep 999 < /proc/$!/ns/net &

and a running Firefox which isolates each of its "Web Content" threads in an unconnected network namespace (all those down lo interfaces):

lo               UNKNOWN        127.0.0.1/8 ::1/128 
eth0             UP             192.0.2.2/24 2001:db8:0:1:bc5c:95c7:4ea6:f94f/64 fe80::b4f0:7aff:fe76:76a8/64 
wlan0            DOWN           
dummy0           UNKNOWN        198.51.100.2/24 fe80::108a:83ff:fe05:e0da/64 
lxcbr0           UP             10.0.3.1/24 2001:db8:0:4::1/64 fe80::216:3eff:fe00:0/64 
virbr0           DOWN           192.168.122.1/24 
virbr0-nic       DOWN           
vethSOEPSH@if9   UP             fe80::fc8e:ff:fe85:476f/64 
end of network 4026531992

lo               DOWN           
end of network 4026532418

lo               DOWN           
end of network 4026532518

lo               DOWN           
end of network 4026532618

lo               DOWN           
end of network 4026532718

lo               UNKNOWN        127.0.0.1/8 ::1/128 
eth0@if10        UP             10.0.3.66/24 fe80::216:3eff:fe6a:c1e9/64 
end of network 4026532822

lo               DOWN           
bridge0          UNKNOWN        fe80::b884:44ff:feaf:dca3/64 
end of network 4026532923

lo               DOWN           
dummy0           DOWN           10.11.12.13/24 
end of network 4026533021

INIT NETWORK: 4026531992

Best Answer

Related Solutions

Chroot – How to Perform Chroot with Linux Namespaces

How to use pivot_root

Why to use pivot_root

Linux Network Interfaces – How to Find All Configured Interfaces Including Containers

Related Question