Why does `systemd-nspawn -n` network namespace not show in `ip netns list`

iproutenamespacenetwork-namespacessystemd-nspawn

tl;dr Linux has namespaces, in particular, network namespaces. It seems the namespace supposedly created via the -n flag when running systemd-nspwawn does not show up when employing ip netns list (neither in the host nor in the supposedly created namespace). It is either systemd-nspawn or ip netns not actually dealing with Linux namespaces (something I thought to be this: https://lwn.net/Articles/531114/#series_index)?

longer story:
I use the following command to run a "light-weight container" of Arch Linux from within my Arch Linux:

systemd-nspawn -nbUD /mntpointArchLinuxSysFs

the data at /mntpointArchLinuxSysFs has been bootstrapped, and "runs/boots" well. The man systemd-nspawn tells me that the -n options-flag means:

-n, --network-veth

Create a virtual Ethernet link ("veth") between host and container. The host side of the Ethernet link will be available as a
network interface named after the container's name (as specified with
--machine=), prefixed with "ve-". The container side of the Ethernet link will be named "host0". The --network-veth option implies
--private-network.

In turn, the implied --private-network is explained thus

--private-network
Disconnect networking of the container from the host. This makes all network interfaces unavailable in the container, with the
exception of the loopback device and those specified with
--network-interface= and configured with --network-veth. If this option is specified, the CAP_NET_ADMIN capability will be added to the
set of capabilities the container retains. The latter may be disabled
by using --drop-capability=. If this option is not specified (or
implied by one of the options listed below), the container will have
full access to the host network.

which seems to be a feat which is achieved via Linux namespaces, in particular Linux network namespaces, this that the started processes (i.e. the init of the container at /mntpointArchLinuxSysFs/bin/init and all child processes are in a different network namespace, i.e. are --private-network and only have the veth (virtual ethernet pair) as a remaining connection to the host namespace/system.

Using lsns shows that indeed systemd-nspawn created a namespace

root@host$> lsns | grep net
4026531992 net       183     1 root     /sbin/init
4026532332 net         1   824 rtkit    /usr/lib/rtkit-daemon
4026532406 net         7  4697 vu-mnt-0 /usr/lib/systemd/systemd

However ip netns list does refuse to "play along":

root@host$> ip netns list
root@host$>

Then is I for the sake of understanding create a dummy namespace via ip netns like this

root@host$> ip netns add dummy_netns
root@host$> ip netns list
dummy_netns
root@host$>

A network namespace is displayed, however, misses ironically in the lsns.

In conclusion, it seems to be unclear how the term "network namespace" is used in systemd-nspawn, ip netns as my test seem to suggest they might not really be the same thing? Maybe the term is ambiguous?

update

this part of the systemd-nspawn man page suggest imho, however that indeed both iproute and systemd-nspawn refer to the same thing in terms of network namespaces.

--network-namespace-path=
Takes the path to a file representing a kernel network namespace
that the container shall run in. The specified path should refer to
a (possibly bind-mounted) network namespace file, as exposed by the
kernel below /proc/$PID/ns/net. This makes the container enter the
given network namespace. One of the typical use cases is to give a
network namespace under /run/netns created by ip-netns(8), for
example, --network-namespace-path=/run/netns/foo. Note that this
option cannot be used together with other network-related options,
such as –private-network or –network-interface=.

Even though the last part stating that it cannot be used with the --private-network option again seems to suggest some sort of distincion. what is going on here?

Best Answer

Both systemd-nspawn and ip-netns use namespaces, specifically network namespaces. The difference, as explained in the ip-netns manual, is that ip-netns deals with named network namespaces.

By convention a named network namespace is an object at /var/run/netns/NAME that can be opened. The file descriptor resulting from opening /var/run/netns/NAME refers to the specified network namespace. Holding that file descriptor open keeps the network namespace alive.

Anonymous network namespaces

The namespaces(7) manual explains that in general, a namespace is an abstraction associated with the lifetime of the processes in it:

Each process has a /proc/[pid]/ns/ subdirectory containing one entry for each namespace that supports being manipulated by setns(2) ... Opening one of the files in this directory (or a file that is bind mounted to one of these files) returns a file handle for the corresponding namespace of the process specified by pid. As long as this file descriptor remains open, the namespace will remain alive, even if all processes in the namespace terminate.

On my system, the most recently launched systemd process (pgrep -f -n systemd\$) is the init process of a container started using the default systemd-nspawn@.service template unit, which enables --network-veth and thus --private-network (it also adds --private-users). This command shows that the container's anonymous network namespace is different to the root network namespace, and owned by the container's root user:

# ls -l /proc/1/ns/net /proc/$(pgrep -f -n systemd\$)/ns/net
lrwxrwxrwx 0 root           /proc/1/ns/net -> net:[4026532008]
lrwxrwxrwx 0 vu-container-0 /proc/700/ns/net -> net:[4026532656]

This anonymous network namespace disappears when the container is terminated. However, if I want to make it a named network namespace that can be managed with ip-netns during the life of the container, I can bind mount it under /run/netns:

# mount --bind /proc/$(pgrep -f -n systemd\$)/ns/net /run/netns/container
# ip netns list
container (id: 1)

Creating named network namespaces with systemd

You've also pointed out systemd-nspawn's --network-namespace-path option, which is equivalent to the NetworkNamespacePath= setting documented in systemd.unit(5). It can only assign containers and units to a network namespace that already exists. Because a process can only be in one namespace, --network-namespace-path is incompatible with options like --private-network which create an anonymous network namespace and isolate the container in it.

It seems that systemd will get a Namespace= setting in some future release of systemd after v246 (v245 was released in March 2020). This will allow units to create their own named network namespaces, rather than being assigned to an existing namespace with NetworkNamespacePath= or creating a new anonymous namespace with PrivateNetwork=. When this feature is merged, it would make sense for Namespace=%i to be added to the systemd-nspawn@.service template, so that containers' network namespaces are named by default.

DNS resolution when using systemd-networkd as DHCP client

Using systemd-networkd as a DHCP client might accidentally work on its own, if you may have a left-over /etc/resolv.conf from a previous container boot. But you can't rely on this working in general. It's really designed to be run together with systemd-resolved.service.

In turn, systemd-resolved is intended to be used with nss-resolve. However this is not essential AIUI.

Network Namespaces – How to Find the Network Namespace of a veth Peer ifindex

Here's the method I followed to find how to understand this problem. Available tools appear usable (with some convolution) for the namespace part, and (UPDATED) using /sys/ can easily get the peer's index. So it's quite long, bear with me. It's in two parts (which are not in the logical order, but namespace first helps explain the the index naming), using common tools, not any custom program:

Network namespace
Interface index

Network namespace

This information is available with the property link-netnsid in the output of ip link and can be matched with the id in the output of ip netns. It's possible to "associate" a container's network namespace with ip netns, thus using ip netns as a specialized tool. Of course doing a specific program for this would be better (some informations about syscalls at the end of each part).

About the nsid's description, here's what man ip netns tells (emphasis mine):

ip netns set NAME NETNSID - assign an id to a peer network namespace

This command assigns a id to a peer network namespace. This id is valid only in the current network namespace. This id will be used by the kernel in some netlink messages. If no id is assigned when the kernel needs it, it will be automatically assigned by the kernel. Once it is assigned, it's not possible to change it.

While creating a namespace with ip netns won't immediately create a netnsid, it will be created (on the current namespace, probably the "host") whenever a veth half is set to an other namespace. So it's always set for a typical container.

Here's an example using an LXC container:

# lxc-start -n stretch-amd64

A new veth link veth9RPX4M appeared (this can be tracked with ip monitor link). Here are the detailed informations:

# ip -o link show veth9RPX4M
44: veth9RPX4M@if43: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue master lxcbr0 state LOWERLAYERDOWN mode DEFAULT group default qlen 1000
link/ether fe:25:13:8a:00:f8 brd ff:ff:ff:ff:ff:ff link-netnsid 4

This link has the property link-netnsid 4, telling the other side is in the network namespace with nsid 4. How to verify it's the LXC container? The easiest way to get this information is making ip netns believe it created the container's network namespace, by doing the operations hinted in the manpage.

# mkdir -p /var/run/netns
# touch /var/run/netns/stretch-amd64
# mount -o bind /proc/$(lxc-info -H -p -n stretch-amd64)/ns/net /var/run/netns/stretch-amd64

UPDATE3: I didn't understand that finding back the global name was a problem. Here it is:

# ls -l /proc/$(lxc-info -H -p -n stretch-amd64)/ns/net
lrwxrwxrwx. 1 root root 0 mai    5 20:40 /proc/17855/ns/net -> net:[4026532831]

# stat -c %i /var/run/netns/stretch-amd64 
4026532831

Now the information is retrieved with:

# ip netns | grep stretch-amd64
stretch-amd64 (id: 4)

It confirms the veth's peer is in the network namespace with the same nsid = 4 = link-netnsid.

The container/ip netns "association" can be removed (without removing the namespace as long as the container is running):

# ip netns del stretch-amd64

Note: the nsid naming is per network namespace, usually starts with 0 for the first container, and the lowest value available is recycled with new namespaces.

About using syscalls, here are informations guessed from strace:

for the link part: it requires an AF_NETLINK socket (opened with socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE)), asking ( sendmsg()) the link's informations with a message type RTM_GETLINK and retrieving (recvmsg()) the reply with message type RTM_NEWLINK.
for the netns nsid part: same method, the query message is type RTM_GETNSID with reply type RTM_NEWNSID.

I think the slightly higher level libraries to handle this are there: libnl. Anyway it's a topic for SO.

Interface index

Now it will be easier to follow why the index appear to have random behaviours. Let's do an experiment:

First enter a new net namespace to have a clean (index) slate:

# ip netns add test
# ip netns exec test bash
# ip netns id
test
# ip -o link 
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000\    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

As OP noted, lo begins with index 1.

Let's add 5 net namespaces, create veth pairs, then put a veth end on them:

# for i in {0..4}; do ip netns add test$i; ip link add type veth peer netns test$i ; done
# ip -o link|sed 's/^/    /'
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000\    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: veth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\    link/ether e2:83:4f:60:5a:30 brd ff:ff:ff:ff:ff:ff link-netnsid 0
3: veth1@if2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\    link/ether 22:a7:75:8e:3c:95 brd ff:ff:ff:ff:ff:ff link-netnsid 1
4: veth2@if2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\    link/ether 72:94:6e:e4:2c:fc brd ff:ff:ff:ff:ff:ff link-netnsid 2
5: veth3@if2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\    link/ether ee:b5:96:63:62:de brd ff:ff:ff:ff:ff:ff link-netnsid 3
6: veth4@if2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\    link/ether e2:7d:e2:9a:3f:6d brd ff:ff:ff:ff:ff:ff link-netnsid 4

When it's displaying @if2 for each of them it becomes quite clear it's the peer's namespace interface index and index are not global, but per namespace. When it's displaying an actual interface name, it's a relation to an interface in the same name space (be it veth's peer, bridge, bond ...). So why veth0 doesn't have a peer displayed? I believe it's an ip link bug when the index is the same as itself. Just moving twice the peer link "solves" it here, because it forced an index change. I'm also sure sometimes ip link do other confusions and instead of displaying @ifXX, displays an interface in the current namespace with the same index.

# ip -n test0 link set veth0 name veth0b netns test
# ip link set veth0b netns test0
# ip -o link
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000\    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: veth0@if7: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\    link/ether e2:83:4f:60:5a:30 brd ff:ff:ff:ff:ff:ff link-netnsid 0
3: veth1@if2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\    link/ether 22:a7:75:8e:3c:95 brd ff:ff:ff:ff:ff:ff link-netnsid 1
4: veth2@if2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\    link/ether 72:94:6e:e4:2c:fc brd ff:ff:ff:ff:ff:ff link-netnsid 2
5: veth3@if2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\    link/ether ee:b5:96:63:62:de brd ff:ff:ff:ff:ff:ff link-netnsid 3
6: veth4@if2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\    link/ether e2:7d:e2:9a:3f:6d brd ff:ff:ff:ff:ff:ff link-netnsid 4

UPDATE: reading again informations in OP's question, the peer's index (but not nsid) is easily and unambiguously available with cat /sys/class/net/ interface /iflink.

UPDATE2:

All those iflink 2 may appear ambiguous, but what is unique is the combination of nsid and iflink, not iflink alone. For the above example that is:

interface    nsid:iflink
veth0        0:7
veth1        1:2
veth2        2:2
veth3        3:2
veth4        4:2

In this namespace (namely namespace test) there will never be two same nsid:pair .

If one was to look from each peer network the opposite information:

namespace    interface    nsid:iflink
test0        veth0        0:2
test1        veth0        0:3
test2        veth0        0:4
test3        veth0        0:5
test4        veth0        0:6

But bear in mind that all the 0: there is for each one a separate 0, that happens to map to the same peer namespace (namely: namespace test, not even the host). They can't be directly compared because they're tied to their namespace. So the whole comparable and unique information should be:

test0:0:2
test1:0:3
test2:0:4
test3:0:5
test4:0:6

Once it's confirmed that "test0:0" == "test1:0" etc. (true in this example, all map to the net namespace called test by ip netns) then they can be really compared.

About syscalls, still looking at strace results,the information is retrieved as above from RTM_GETLINK. Now there should be all informations available:

local: interface index with SIOCGIFINDEX / if_nametoindex
peer: both nsid and interface index with RTM_GETLINK.

All this should probably be used with libnl.