What’s “broken” about cpuset cgroup inheritance semantics in the Linux kernel

cgroupscpu usagesystemd

To quote the 2013 systemd announcement of the new control group interface (with emphasis added):

Note that the number of cgroup attributes currently exposed as unit properties is limited. This will be extended later on, as their kernel interfaces are cleaned up. For example cpuset or freezer are currently not exposed at all due to the broken inheritance semantics of the kernel logic. Also, migrating units to a different slice at runtime is not supported (i.e. altering the Slice= property for running units) as the kernel currently lacks atomic cgroup subtree moves.

So, what's broken about the inheritance semantics of the kernel logic for cpuset (and how does this brokenness not apply to other cgroup controllers such as cpu)?

There is an article on RedHat's website giving an unverified solution for how to use cgroup cpusets in RHEL 7 despite their lack of support as easy-to-manage systemd unit properties…but is this even a good idea? The bolded quotation above is concerning.

To put it another way, what are the "gotchas" (pitfalls) that could apply to using cgroup v1 cpuset which are being referenced here?


I'm starting a bounty on this.

Possible sources of information to answer this question (in no special order) include:

  1. cgroup v1 documentation;
  2. kernel source code;
  3. test results;
  4. real-world experience.

One possible meaning of the bolded line in the quote above would be that when a new process is forked it does not stay in the same cpuset cgroup as its parent, or that it is in the same cgroup but in some sort of "unenforced" status whereby it may actually be running on a different CPU than the cgroup allows. However, this is pure speculation on my part and I need a definitive answer.

Best Answer

I'm not nearly well-versed enough with cgroups to give a definitive answer (and I certainly don't have experience with cgroups going back to 2013!) but on a vanilla Ubuntu 16.04 cgroups v1 seems to have it's act together:

I devised a small test that forces forking as a different user using a child sudo /bin/bash spun off with & - the -H flag is extra paranoia to force sudo to execute with root's home environment.

cat <(whoami) /proc/self/cgroup >me.cgroup && \
sudo -H /bin/bash -c 'cat <(whoami) /proc/self/cgroup >you.cgroup' & \
sleep 2 && diff me.cgroup you.cgroup

This yields:

1c1
< admlocal
---
> root

For reference, this is the structure of cgroup mounts on my system:

$ mount | grep group
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
lxcfs on /var/lib/lxcfs type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
$
Related Question