How to create user cgroups with systemd

cgroupscontainerlxcsystemdvirtualization

I use unprivileged lxc containers in Arch Linux. Here are the basic system infos:

[chb@conventiont ~]$ uname -a
Linux conventiont 3.17.4-Chb #1 SMP PREEMPT Fri Nov 28 12:39:54 UTC 2014 x86_64 GNU/Linux

It's a custom/compiled kernel with user namespace enabled:

[chb@conventiont ~]$ lxc-checkconfig 
--- Namespaces ---
Namespaces: enabled
Utsname namespace: enabled
Ipc namespace: enabled
Pid namespace: enabled
User namespace: enabled
Network namespace: enabled
Multiple /dev/pts instances: enabled

--- Control groups ---
Cgroup: enabled
Cgroup clone_children flag: enabled
Cgroup device: enabled
Cgroup sched: enabled
Cgroup cpu account: enabled
Cgroup memory controller: enabled
Cgroup cpuset: enabled

--- Misc ---
Veth pair device: enabled
Macvlan: enabled
Vlan: enabled
File capabilities: enabled

Note : Before booting a new kernel, you can check its configuration
usage : CONFIG=/path/to/config /usr/bin/lxc-checkconfig

[chb@conventiont ~]$ systemctl --version
systemd 217
+PAM -AUDIT -SELINUX -IMA -APPARMOR +SMACK -SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID -ELFUTILS +KMOD +IDN 

Unfortunately, systemd does not play well with lxc currently. Especially setting up cgroups for a non-root user seems to be working not well or I am just too unfamiliar how to do this. lxc will only start a container in unprivileged mode when it can create the necessary cgroups in /sys/fs/cgroup/XXX/*. This however is not possible for lxc because systemd mounts the root cgroup hierarchy in /sys/fs/cgroup/*. A workaround seems to be to do the following:

for d in /sys/fs/cgroup/*; do
        f=$(basename $d)
        echo "looking at $f"
        if [ "$f" = "cpuset" ]; then
                echo 1 | sudo tee -a $d/cgroup.clone_children;
        elif [ "$f" = "memory" ]; then
                echo 1 | sudo tee -a $d/memory.use_hierarchy;
        fi
        sudo mkdir -p $d/$USER
        sudo chown -R $USER $d/$USER
        echo $$ > $d/$USER/tasks
done

This code creates the corresponding cgroup directories in the cgroup hierarchy for an unprivileged user. However, something which I don't understand happens. Before executing the aforementioned I will see this:

[chb@conventiont ~]$ cat /proc/self/cgroup 
8:blkio:/
7:net_cls:/
6:freezer:/
5:devices:/
4:memory:/
3:cpu,cpuacct:/
2:cpuset:/
1:name=systemd:/user.slice/user-1000.slice/session-c1.scope

After executing the aforementioned code I see in the shell I ran it in:

[chb@conventiont ~]$ cat /proc/self/cgroup 
8:blkio:/chb
7:net_cls:/chb
6:freezer:/chb
5:devices:/chb
4:memory:/chb
3:cpu,cpuacct:/chb
2:cpuset:/chb
1:name=systemd:/chb

But in any other shell I still see:

[chb@conventiont ~]$ cat /proc/self/cgroup 
8:blkio:/
7:net_cls:/
6:freezer:/
5:devices:/
4:memory:/
3:cpu,cpuacct:/
2:cpuset:/
1:name=systemd:/user.slice/user-1000.slice/session-c1.scope

Hence, I can start my unprivileged lxc container in the shell I executed the code mentioned above but not in any other.

  1. Can someone explain this behaviour?

  2. Has someone found a better way to set up the required cgroups with a current version of systemd (>= 217)?

Best Answer

A better and safer solution is to install cgmanager and run it with systemctl start cgmanager (on a systemd-based distro). You can than have your root user, or if you have sudo rights on the host create cgroups for your unprivileged user in all controllers with:

sudo cgm create all $USER
sudo cgm chown all $USER $(id -u $USER) $(id -g $USER)

Once they have been created for your unprivileged user she/he can move processes he has access to into his cgroup for every controller by using:

cgm movepid all $USER $PPID

Safer, faster, more reliable than the shell script I posted.

Manual solution:

To answer 1.

for d in /sys/fs/cgroup/*; do
        f=$(basename $d)
        echo "looking at $f"
        if [ "$f" = "cpuset" ]; then
                echo 1 | sudo tee -a $d/cgroup.clone_children;
        elif [ "$f" = "memory" ]; then
                echo 1 | sudo tee -a $d/memory.use_hierarchy;
        fi
        sudo mkdir -p $d/$USER
        sudo chown -R $USER $d/$USER
        echo $$ > $d/$USER/tasks
done

I was ignorant about what was going on exactly when I wrote that script but reading the cgroups documentation and experimenting a bit helped me to understand what is going on. What I am basically doing in this script is to create a new cgroup session for the current user which is what I already stated above. When I run these commands in the current shell or run them in a script and make it so that it gets evaluated in the current shell and not in a subshell (via . script The . is important for this to work!) is that I not just open a new session for user but add the current shell as a process that runs in this new cgroup. I can achieve the same effect by running the script in a subshell and then descend into the cgroup hierarchy in the chb subcgroup and use echo $$ > tasks to add the current shell to every member of the chb cgroup hierarchy.

Hence, when I run lxc in that current shell my container will also become a member of all the chb subcgroups that the current shell is a member of. That is to say my container inherits the cgroup status of my shell. This also explains why it doesn't work in any other shell that is not part of the current chb subcgroups.

I still pass at 2.. We'll probably need to wait either for a systemd update or further Kernel developments to make systemd adopt a consistent behaviour but I prefer the manual setup anyway as it forces you to understand what you're doing.

Related Question