Entering a mount namespace before setting up a chroot
, lets you avoid cluttering the host namespace with additional mounts, e.g. for /proc
. You can use chroot
inside a mount namespace as a nice and simple hack.
I think there are advantages to understanding pivot_root
, but it has a bit of a learning curve. The documentation does not quite explain everything... although there is a usage example in man 8 pivot_root
(for the shell command). man 2 pivot_root
(for the system call) might be clearer if it did the same, and included an example C program.
How to use pivot_root
Immediately after entering the mount namespace, you also need mount --make-rslave /
or equivalent. Otherwise, all your mount changes propagate to the mounts in the original namespace, including the pivot_root
. You don't want that :).
If you used the unshare --mount
command, note it is documented to apply mount --make-rprivate
by default. AFAICS this is a bad default and you don't want this in production code. E.g. at this point, it would stop eject
from working on a mounted DVD or USB in the host namespace. The DVD or USB would remain mounted inside the private mount tree, and the kernel would not let you eject the DVD.
Once you've done that, you can mount e.g. the /proc
directory you will be using. The same way you would for chroot
.
Unlike when you use chroot
, pivot_root
requires that your new root filesystem is a mount point. If it is not one already, you can satisfy this by simply applying a bind mount: mount --rbind new_root new_root
.
Use pivot_root
- and then umount
the old root filesystem, with the -l
/ MNT_DETACH
option. (You don't need umount -R
, which can take longer.).
Technically, using pivot_root
generally needs to involve using chroot
as well; it's not "either-or".
As per man 2 pivot_root
, it's only defined as swapping the root of the mount namespace. It isn't defined to change which physical directory the process root is pointing to. Or the current working directory (/proc/self/cwd
). It happens that it does do so, but this is a hack to handle kernel threads. The manpage says that could change in future.
Usually you want this sequence:
chdir(new_root); // cd new_root
pivot_root(".", put_old); // pivot_root . put_old
chroot("."); // chroot .
The postition of the chroot
in this sequence is yet another subtle detail. Although the point of pivot_root
is to rearrange the mount namespace, the kernel code seems to find the root filesystem to move by looking at the per-process root, which is what chroot
sets.
Why to use pivot_root
In principle, it makes sense to use pivot_root
for security and isolation. I like to think about the theory of capability-based security. You pass in a list of the specific resources needed, and the process can access no other resources. In this case we are talking about the filesystems passed in to a mount namespace. This idea applies generally to the Linux "namespaces" feature, though I'm probably not expressing it very well.
chroot
only sets the process root, but the process still refers to the full mount namespace. If a process retains the privilege to perform chroot
, then it can traverse back up the filesystem namespace. As detailed in man 2 chroot
, "the superuser can escape from a 'chroot jail' by...".
Another thought-provoking way to undo chroot
is nsenter --mount=/proc/self/ns/mnt
. This is perhaps a stronger argument for the principle. nsenter
/ setns()
necessarily re-loads the process root, from the root of the mount namespace... although the fact that this works when the two refer to different physical directories, might be considered a kernel bug. (Technical note: there could be multiple filesystems mounted on top of each other at the root; setns()
uses the top, most recently mounted one).
This illustrates one advantage of combining a mount namespace with a "PID namespace". Being inside a PID namespace would prevent you from entering the mount namespace of an unconfined process. It also prevents you entering the root of an unconfined process (/proc/$PID/root
). And of course a PID namespace also prevents you from killing any process which is outside it :-).
An interface, at a given time, belongs to one network namespace and only one. The init (initial) network namespace, except for inheriting physical interfaces of destroyed network namespaces has no special ability over other network namespaces: it can't see directly their interfaces. As long as you are still in init's pid and mount namespaces, you can still find the network namespaces by using different informations available from /proc
and finally display their interfaces by entering those network namespaces.
I'll provide examples in shell.
enumerate the network namespaces
For this you have to know how those namespaces are existing: as long as a resource keep them up. A resource here can be a process (actually a process' thread), a mount point or an open file descriptor (fd). Those resources are all referenced in /proc/
and point to an abstract pseudo-file in the nsfs
pseudo-filesystem enumerating all namespaces. This file's only meaningful information is its inode, representing the network namespace, but the inode can't be manipulated alone, it has to be the file. That's why later we can't just keep only the inode value (given by stat -c %i /proc/some/file
): we'll keep the inode to be able to remove duplicates and a filename to still have an usable reference for nsenter
later.
process (actually thread)
The most common case: for usual containers. Each thread's network namespace can be known via the reference /proc/pid/ns/net
: just stat
them and enumerate all unique namespaces. The 2>/dev/null
is to hide when stat
can't find ephemeral processes anymore.
find /proc/ -mindepth 1 -maxdepth 1 -name '[1-9]*' | while read -r procpid; do
stat -L -c '%20i %n' $procpid/ns/net
done 2>/dev/null
This can be done faster with the specialized lsns
command which deals with namespaces, but appears to handle only processes (not mount points nor open fd as seen later):
lsns -n -u -t net -o NS,PATH
(which would have to be reformatted for later as lsns -n -u -t net -o NS,PATH | while read inode path; do printf '%20u %s\n' $inode "$path"; done
)
mount point
Those are mostly used by the ip netns add
command which creates permanent network namespaces by mounting them, thus avoiding them disappearing when there is no process nor fd resource keeping them up, then also allowing for example to run a router, firewall or bridge in a network namespace without any linked process.
Mounted namespaces (handling of mount and perhaps pid namespaces is probably more complex but we're only interested in network namespaces anyway) appear like any other mount point in /proc/mounts
, with the filesystem type nsfs
. There's no easy way in shell to distinguish a network namespace from an other type of namespace, but since two pseudo-files from the same filesystem (here nsfs
) won't share the same inode, just elect them all and ignore errors later in the interface step when trying to use a non-network namespace reference as network namespace. Sorry, below I won't handle correctly mount points with special characters in them, including spaces, because they are already escaped in /proc/mounts
's output (it would be easier in any other language), so I won't bother either to use null terminated lines.
awk '$3 == "nsfs" { print $2 }' /proc/mounts | while read -r mount; do
stat -c '%20i %n' "$mount"
done
open file descriptor
Those are probably even more rare than mount points except temporarily at namespace creation, but might be held and used by some specialized application handling multiple namespaces, including possibly some containerization technology.
I couldn't devise a better method than search all fd available in every /proc/pid/fd/
, using stat to verify it points to a nsfs
namespace and again not caring for now if it's really a network namespace. I'm sure there's a more optimized loop, but this one at least won't wander everywhere nor assume any maximum process limit.
find /proc/ -mindepth 1 -maxdepth 1 -name '[1-9]*' | while read -r procpid; do
find $procpid/fd -mindepth 1 | while read -r procfd; do
if [ "$(stat -f -c %T $procfd)" = nsfs ]; then
stat -L -c '%20i %n' $procfd
fi
done
done 2>/dev/null
Now remove all duplicate network namespace references from previous results. Eg by using this filter on the combined output of the 3 previous results (especially from the open file descriptor part):
sort -k 1n | uniq -w 20
in each namespace enumerate the interfaces
Now we have the references to all the existing network namespaces (and also some non-network namespaces which we'll just ignore), simply enter each of them using the reference and display the interfaces.
Take the previous commands' output as input to this loop to enumerate interfaces (and as per OP's question, choose to display their addresses), while ignoring errors caused by non-network namespaces as previously explained:
while read -r inode reference; do
if nsenter --net="$reference" ip -br address show 2>/dev/null; then
printf 'end of network %d\n\n' $inode
fi
done
The init network's inode can be printed with pid 1 as reference:
echo -n 'INIT NETWORK: ' ; stat -L -c %i /proc/1/ns/net
Example (real but redacted) output with a running LXC container,an empty "mounted" network namepace created with ip netns add ...
having an unconnected bridge interface, a network namespace with an other dummy0
interface, kept alive by a process not in this network namespace but keeping an open fd on it, created with:
unshare --net sh -c 'ip link add dummy0 type dummy; ip address add dev dummy0 10.11.12.13/24; sleep 3' & sleep 1; sleep 999 < /proc/$!/ns/net &
and a running Firefox which isolates each of its "Web Content" threads in an unconnected network namespace (all those down lo
interfaces):
lo UNKNOWN 127.0.0.1/8 ::1/128
eth0 UP 192.0.2.2/24 2001:db8:0:1:bc5c:95c7:4ea6:f94f/64 fe80::b4f0:7aff:fe76:76a8/64
wlan0 DOWN
dummy0 UNKNOWN 198.51.100.2/24 fe80::108a:83ff:fe05:e0da/64
lxcbr0 UP 10.0.3.1/24 2001:db8:0:4::1/64 fe80::216:3eff:fe00:0/64
virbr0 DOWN 192.168.122.1/24
virbr0-nic DOWN
vethSOEPSH@if9 UP fe80::fc8e:ff:fe85:476f/64
end of network 4026531992
lo DOWN
end of network 4026532418
lo DOWN
end of network 4026532518
lo DOWN
end of network 4026532618
lo DOWN
end of network 4026532718
lo UNKNOWN 127.0.0.1/8 ::1/128
eth0@if10 UP 10.0.3.66/24 fe80::216:3eff:fe6a:c1e9/64
end of network 4026532822
lo DOWN
bridge0 UNKNOWN fe80::b884:44ff:feaf:dca3/64
end of network 4026532923
lo DOWN
dummy0 DOWN 10.11.12.13/24
end of network 4026533021
INIT NETWORK: 4026531992
Best Answer
mount namespaces differ in the arrangement of mounted filesystems.
This is very flexible, because mounts can be bind mounts of a sub-directory within a filesystem.
You can list your current set of mounts with the
findmnt
command.In a full container, the root mount is replaced and you work with an entirely separate tree of mounts. This involves some extra details, such as the
pivot_root()
system call. You probably don't need to know exactly how to do that. Some details are available here: How to perform chroot with Linux namespaces?