Unprivileged LXC containers are the ones making use of user namespaces (userns). I.e. of a kernel feature that allows to map a range of UIDs on the host into a namespace inside of which a user with UID 0 can exist again.
Contrary to my initial perception of unprivileged LXC containers for a while, this does not mean that the container has to be owned by an unprivileged host user. That is only one possibility.
Relevant is:
- that a range of subordinate UIDs and GIDs is defined for the host user (
usermod [-v|-w|--add-sub-uids|--add-sub-gids]
)
- ... and that this range is mapped in the container configuration (
lxc.id_map = ...
)
So even root
can own unprivileged containers, since the effective UIDs of container processes on the host will end up inside the range defined by the mapping.
However, for root
you have to define the subordinate IDs first. Unlike users created via adduser
, root
will not have a range of subordinate IDs defined by default.
Also keep in mind that the full range you give is at your disposal, so you could have 3 containers with the following configuration lines (only UID mapping shown):
lxc.id_map = u 0 100000 100000
lxc.id_map = u 0 200000 100000
lxc.id_map = u 0 300000 100000
NB: as per a comment recent versions call this lxc.idmap
!
assuming that root
owns the subordinate UIDs between 100000 and 400000. All documentation I found suggests to use 65536 subordinate IDs per container, some use 100000 to make it more human-readbable, though.
In other words: You don't have to assign the same range to each container.
With over 4 billion (~ 2^32
) possible subordinate IDs that means you can be generous when dealing the subordinate ranges to your host users.
Unprivileged container owned and run by root
To rub that in again. An unprivileged LXC guest does not require to be run by an unprivileged user on the host.
Configuring your container with a subordinate UID/GID mapping like this:
lxc.id_map = u 0 100000 100000
lxc.id_map = g 0 100000 100000
where the user root
on the host owns that given subordinate ID range, will allow you to confine guests even better.
However, there is one important additional advantage in such a scenario (and yes, I have verified that it works): you can auto-start your container at system startup.
Usually when scouring the web for information about LXC you will be told that it is not possible to autostart an unprivileged LXC guest. However, that is only true by default for those containers which are not in the system-wide storage for containers (usually something like /var/lib/lxc
). If they are (which usually means they were created by root and are started by root), it's a whole different story.
lxc.start.auto = 1
will do the job quite nicely, once you put it into your container config.
Getting permissions and configuration right
I struggled with this myself a bit, so I'm adding a section here.
In addition to the configuration snippet included via lxc.include
which usually goes by the name /usr/share/lxc/config/$distro.common.conf
(where $distro
is the name of a distro), you should check if there is also a /usr/share/lxc/config/$distro.userns.conf
on your system and include that as well. E.g.:
lxc.include = /usr/share/lxc/config/ubuntu.common.conf
lxc.include = /usr/share/lxc/config/ubuntu.userns.conf
Furthermore add the subordinate ID mappings:
lxc.id_map = u 0 100000 65535
lxc.id_map = g 0 100000 65535
which means that the host UID 100000 is root
inside the user namespace of the LXC guest.
Now make sure that the permissions are correct. If the name of your guest would be stored in the environment variable $lxcguest
you'd run the following:
# Directory for the container
chown root:root $(lxc-config lxc.lxcpath)/$lxcguest
chmod ug=rwX,o=rX $(lxc-config lxc.lxcpath)/$lxcguest
# Container config
chown root:root $(lxc-config lxc.lxcpath)/$lxcguest/config
chmod u=rw,go=r $(lxc-config lxc.lxcpath)/$lxcguest/config
# Container rootfs
chown 100000:100000 $(lxc-config lxc.lxcpath)/$lxcguest/rootfs
chmod u=rwX,go=rX $(lxc-config lxc.lxcpath)/$lxcguest/rootfs
This should allow you to run the container after your first attempt may have given some permission-related errors.
Best Answer
Wanted to post the answer to this question in case anyone else sees a similar confusing result. It looks like I had two problems:
Need to use # of cpus on host, not # available CPUs in the cgroups cpuset to estimate CPU bandwidth:
(# of cpus on the host) * (cpu.cfs_period_us) * (.25) so 40 * 100000 * .25 = 1000000
My run of stress-ng inside the container was using the cpu and cpuset controllers of the /lxc/foo cgroup while the run of stress-ng outside of the container was using the /system/sshd.service cgroup
To better model my real world application I should have specified which controllers to use by using cgexec: