Why is conmon in a Different cgroup When Podman is Started with systemd

cgroupscontainersdockersystemd

Given podman is installed on a linux system and a systemd unit named baz.service:

# /etc/systemd/system/baz.service
[Service]
ExecStart=/usr/bin/podman run --rm --tty --name baz alpine sh -c 'while true; do date; sleep 1; done'
ExecStop=/usr/bin/podman stop baz

And the baz.service started:

# systemctl daemon-reload
# systemctl start baz.service

Then when I check the status of the unit I don't see the sh or sleep process in the /system.slice/baz.service cgroup

# systemctl status baz
● baz.service
   Loaded: loaded (/etc/systemd/system/baz.service; static; vendor preset: enabl
   Active: active (running) since Sat 2019-08-10 05:50:18 UTC; 14s ago
 Main PID: 16910 (podman)
    Tasks: 9
   Memory: 7.3M
      CPU: 68ms
   CGroup: /system.slice/baz.service
           └─16910 /usr/bin/podman run --rm --tty --name baz alpine sh -c while
# ...

I was expecting to see the sh and sleep children in my baz.service status because I've heard people from redhat say podman uses a traditional fork-exec model.

If podman did fork and exec, then wouldn't my sh and sleep process be children of podman and be in the same cgroup as the original podman process?

I was expecting to be able to use systemd and podman to be able to manage my containers without the children going off to a different parent and escape from my baz.service ssystemd unit.

Looking at the output of ps I can see that sh and sleep are actually children of a different process called conmon. I'm not sure where conmon came from, or how it was started but systemd didn't capture it.

# ps -Heo user,pid,ppid,comm
# ...
root     17254     1   podman
root     17331     1   conmon
root     17345 17331     sh
root     17380 17345       sleep

From the output it's clear that my baz.service unit is not managing the conmon -> sh -> sleep chain.

How is podman different from the docker client server model?
How is podman's conmon different from docker's containerd?

Maybe they are both container runtimes and the the dockerd daemon is what people people want to get rid of.

So maybe docker is like:

dockerd daemon
docker cli
containerd container runtime

And podman is like:

podman cli
conmon container runtime

So maybe podman uses a traditional fork exec model but it's not the podman cli that's forking and exec, it's the conmon process.

I feel confused.

Best Answer

The whole idea behind podman is to go away from the centralized architecture with the super-powerful overseer (e.g. dockerd), where the centralized daemon is a single point of failure. There even is a hashtag about this - "#nobigfatdaemons".

How to avoid the centralized container management? You remove the single main daemon (again, dockerd) and start the containers independently (at the end of the day, containers are just processes, so you don't need the daemon to spawn them).

However, you still need the way to

collect container's logs - someone has to hold stdout and stderr of the container;
collect container's exit code - someone has to wait(2) on container's PID 1;

For this purpose, each podman container is still supervised by a small daemon, called conmon (from "container monitor"). The difference with the Docker daemon is that this daemon is as small as possible (check the size of the source code), and it is spawned per-container. If conmon for one container crashes, the rest of the system stays unaffected.

Next, how the container gets spawned?

Considering that the user may want to run the container in the background, like with Docker, the podman run process forks twice and only then executes conmon:

$ strace -fe trace=fork,vfork,clone,execve -qq podman run alpine
execve("/usr/bin/podman", ["podman", "run", "alpine"], 0x7ffeceb01518 /* 30 vars */) = 0
...
[pid  8480] clone(child_stack=0x7fac6bffeef0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[8484], tls=0x7fac6bfff700, child_tidptr=0x7fac6bfff9d0) = 8484
...
[pid  8484] clone(child_stack=NULL, flags=CLONE_VM|CLONE_VFORK|SIGCHLD <unfinished ...>
[pid  8491] execve("/usr/bin/conmon", ... <unfinished ...>
[pid  8484] <... clone resumed>)        = 8491

The middle process between podman run and conmon (i.e. the direct parent of conmon - in the example above it is PID 8484) will exit and conmon will be reparented by init, thus becoming self-managed daemon. After this, conmon also forks off the runtime (e.g. runc) and, finally, the runtime executes the container's entrypoint (e.g. /bin/sh).

When the container is running, podman run is no longer required and may exit, but in your case it stays online, because you did not ask it to detach from the container.

Next, podman makes use of cgroups to limit the containers. This means that it creates new cgroups for new containers and moves the processes there. By the rules of cgroups, the process may be the member of only one cgroup at a time, and adding the process to some cgroup removes it from other cgroup (where it was previously) within the same hierarchy. So, when the container is started, the final layout of cgroups looks like the following: podman run remains in cgroups of the baz.service, created by systemd, the conmon process is placed in its own cgroups, and containerized processes are placed in their own cgroups:

$ ps axf
<...>
 1660 ?        Ssl    0:01 /usr/bin/podman run --rm --tty --name baz alpine sh -c while true; do date; sleep 1; done
 1741 ?        Ssl    0:00 /usr/bin/conmon -s -c 2f56e37a0c5ca6f4282cc4c0f4c8e5c899e697303f15c5dc38b2f31d56967ed6 <...>
 1753 pts/0    Ss+    0:02  \_ sh -c while true; do date; sleep 1; done
13043 pts/0    S+     0:00      \_ sleep 1
<...>

$ cd /sys/fs/cgroup/memory/machine.slice
$ ls -d1 libpod*
libpod-2f56e37a0c5ca6f4282cc4c0f4c8e5c899e697303f15c5dc38b2f31d56967ed6.scope
libpod-conmon-2f56e37a0c5ca6f4282cc4c0f4c8e5c899e697303f15c5dc38b2f31d56967ed6.scope

$ cat libpod-2f56e37a0c5ca6f4282cc4c0f4c8e5c899e697303f15c5dc38b2f31d56967ed6.scope/cgroup.procs 
1753
13075

$ cat libpod-conmon-2f56e37a0c5ca6f4282cc4c0f4c8e5c899e697303f15c5dc38b2f31d56967ed6.scope/cgroup.procs 
1741

Note: PID 13075 above is actually a sleep 1 process, spawned after the death of PID 13043.

Hope this helps.

Related Solutions

Linux – How to run systemd services in Arch Linux Docker container

I ran into the same problem testing my Ansible playbooks which require systemd. And as you said, docker seems like the best approach here as it is much easier to bring up and down a container rather than a virtual machine.
First of all base/archlinux image is deprecated - you should use archlinux/base instead. Then, to run systemd totally unprivileged, number of things should be done:

provide a container= variable, so systemd won't try to do number of things it usually does booting a hardware machine
systemd actively uses cgroups, so bind mount /sys/fs/cgroup file system from a host
bind mounting /sys/fs/fuse is not required but helps to avoid issues with fuse-dependent software
systemd thinks that using tmpfs everywhere is a good approach, but running unprivileged makes it impossible for it to mount tmpfs where ever it wants, so pre-mount tmpfs to /tmp, /run and /run/lock
as the last bit you need to specify sysinit.target as default unit to boot instead of multi-user.target or whatever, as you really do not want to start graphical things inside a container

The resulting command line is

docker run \
  --entrypoint=/usr/lib/systemd/systemd \
  --env container=docker \
  --mount type=bind,source=/sys/fs/cgroup,target=/sys/fs/cgroup \
  --mount type=bind,source=/sys/fs/fuse,target=/sys/fs/fuse \
  --mount type=tmpfs,destination=/tmp \
  --mount type=tmpfs,destination=/run \
  --mount type=tmpfs,destination=/run/lock \
    archlinux/base --log-level=info --unit=sysinit.target

If we are talking about running particular service there like ntpd from your example you will need to add

--cap-add=SYS_TIME

otherwise ntpd will fail with permission deny as nobody wants a container to set system time by default.

P.s I spent quite a while learning how systemd behaves and managed to get it working on number of operating system images. I described my experience in an article Running systemd in docker container. It is in Russian but I believe google translate should work in your browser. Thanks

Why x0vncserver is not starting at boot

As x0vncserver will only start after xsession, and as mine is an LXDE environment, the easier way is to create a script.sh file like following

#!/bin/bash
x0vncserver -display :0 -rfbport 5900 -passwordfile ~/.vnc/passwd

and make it executable by sudo chmod +x script.sh and add a line in ~/.config/lxsession/LXDE/autostart file like following

path_to_the_script_file/script.sh

and the x0vncserver starts as my LXDE desktop session starts and that's what I need.

Best Answer

Related Solutions

Linux – How to run systemd services in Arch Linux Docker container

Why x0vncserver is not starting at boot

Related Question