Iptables and cgroups v2 (netfilter’s xt_cgroup)

cgroupsiptablesnetworking

I can't seem to match processes running in cgroup v2 hierarchies with the cgroup module of iptables. I am running Linux 4.13.0 with all required modules:

$ grep CGROUP <kernel_config>
CONFIG_CGROUPS=y
CONFIG_BLK_CGROUP=y
# CONFIG_DEBUG_BLK_CGROUP is not set
CONFIG_CGROUP_WRITEBACK=y
CONFIG_CGROUP_SCHED=y
CONFIG_CGROUP_PIDS=y
# CONFIG_CGROUP_RDMA is not set
CONFIG_CGROUP_FREEZER=y
# CONFIG_CGROUP_HUGETLB is not set
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_PERF=y
# CONFIG_CGROUP_BPF is not set
# CONFIG_CGROUP_DEBUG is not set
CONFIG_SOCK_CGROUP_DATA=y
**CONFIG_NETFILTER_XT_MATCH_CGROUP=m**
CONFIG_NET_CLS_CGROUP=m
CONFIG_CGROUP_NET_PRIO=y
CONFIG_CGROUP_NET_CLASSID=y

$ lsmod | grep cgroup
xt_cgroup              16384  2
x_tables               36864  7 xt_LOG,xt_cgroup,iptable_mangle,ip_tables,iptable_filter,xt_mark,ipt_MASQUERADE

It's a Debian based distro with systemd-235, which mounts the following cgroups:

$ mount | grep cgroup
tmpfs on /sys/fs/cgroup type tmpfs (rw,mode=755)
cgroup on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)

If I work with cgroup v1 and net_cls, all is fine:

$ cd /sys/fs/cgroup/net_cls,net_prio/
$ mkdir test
$ echo 1 > test/net_cls.classid
$ iptables -A OUTPUT -m cgroup --cgroup 1 -j LOG
$ ping -i 2 google.com &>/dev/null &
$ pgrep ping > test/tasks

I can see the packets in the log. Doing the same with cgroup v2 successfully adds the iptables rules but does not match:

$ cd /sys/fs/cgroup/unified/
$ mkdir test
$ iptables -A OUTPUT -m cgroup --path test -j LOG
$ ping -i 2 google.com &>/dev/null &
$ pgrep ping > test/cgroup.procs

The process is running inside this cgroup:

$ cat /proc/<pid>/cgroup
0::/test

and iptables did not complain about an invalid cgroup path, but nothing shows up in the log.

Background

I need to run a tor relay outside my VPN traffic which is used for all packet going outside my LAN. I followed the approach outlined in this answer and it works great (with cgroup v1). The problem is that I didn't find a straightforward way to create a custom cgroup at boot (cgmanager fails to start due to apparent lack of cgroup v2 support) and to assign the tor process to it (how to do it inside a systemd service?). But systemd does create a separate cgroup inside the unified cgroup v2 hierarchy for every service, so the tor process lives in system.slice/system-tor.slice. As shown by a simple example above, iptables can't seem to match this traffic.

Best Answer

Part of the answer to your question is in my answer that you linked:

if you want to move an already running process to the cgroup, well... you can't! (...) iptables (...) doesn't match when the cgroup is switched

Well, iptables matches sometimes in this case, like in your cgroup v1 log rule.

Still, iptables seems to always match for the moved process children, as they are immediately created with the right cgroup. So a solution is to start a new shell, move the shell in the cgroup, and run the desired command in this new shell:

sh -c "echo \$$ > /sys/fs/cgroup/unified/test/cgroup.procs && ping 8.8.8.8"

That's indeed what this cgexec replacement script for cgroup v2 does. You may need to edit the script to replace CGBASE variable value with /sys/fs/cgroup/unified (get the correct path for your environment with mount -t cgroup2).

EDIT: Updated novpn.sh to support cgroups v2 with -2 flag.

But is it supposed to work?

I'm a bit surprised that this answer for cgroup v2 actually works given this issue - more in the Notes of this page.

cgroup controllers can only be mounted in one hierarchy (v1 or v2).

$ cat /proc/cgroups
#subsys_name    hierarchy   num_cgroups enabled
...
net_cls         3           1           1

Which means the net_cls controller is bound to cgroup v1 (otherwise hierarchy would be 0) but iptables still works with cgroup v2 parameter. How I understand it: net_cls network controller is just a cgroup v1 concept that was replaced by cgroup v2 cgroup namespace. So it seems we can use both iptables cgroup v1 and iptables cgroup v2 rules at the same time if the OS supports both cgroup v1 and v2.

Background notes on running services in a network control group:

Except Fedora 31 that switched to cgroup v2 by default, at this time, most distributions still use cgroup v1 by default. cgmanager is indeed not needed and I recently removed it from the requirements from the answer you linked.

cgmanager is deprecated and was dropped in bionic, in favor of systemd own cgroup management implementation. Unfortunately, systemd maintainers have dropped NetClass option for cgroup v1, because they focus on cgroup v2.

So with cgroup v1, it becomes tricky to run services in a network control group because you need to do all these steps BEFORE the desired service main process (e.g. tor relay, apache executable, whatever) gets executed, without any help from systemd which is the service launcher:

  1. Create the cgroup
  2. Create the iptables rules (same issue for cgroup v2)
  3. Start (and not move!) the service main process in the proper cgroup, which is tricky for e.g. apache2 as its direct parent is normally systemd (PID=1) and not some random subshell that you can move to the cgroup

This might be possible with the systemd unit service initialization script. Otherwise, cgconfig could be used, see this question/answer for Ubuntu - but I'd stay away of cgrulesengd as it may interfere with systemd.

Related Question