Why is conmon in a Different cgroup When Podman is Started with systemd

cgroupscontainersdockersystemd

Given podman is installed on a linux system and a systemd unit named baz.service:

# /etc/systemd/system/baz.service
[Service]
ExecStart=/usr/bin/podman run --rm --tty --name baz alpine sh -c 'while true; do date; sleep 1; done'
ExecStop=/usr/bin/podman stop baz

And the baz.service started:

# systemctl daemon-reload
# systemctl start baz.service

Then when I check the status of the unit I don't see the sh or sleep process in the /system.slice/baz.service cgroup

# systemctl status baz
● baz.service
   Loaded: loaded (/etc/systemd/system/baz.service; static; vendor preset: enabl
   Active: active (running) since Sat 2019-08-10 05:50:18 UTC; 14s ago
 Main PID: 16910 (podman)
    Tasks: 9
   Memory: 7.3M
      CPU: 68ms
   CGroup: /system.slice/baz.service
           └─16910 /usr/bin/podman run --rm --tty --name baz alpine sh -c while
# ...

I was expecting to see the sh and sleep children in my baz.service status because I've heard people from redhat say podman uses a traditional fork-exec model.

If podman did fork and exec, then wouldn't my sh and sleep process be children of podman and be in the same cgroup as the original podman process?

I was expecting to be able to use systemd and podman to be able to manage my containers without the children going off to a different parent and escape from my baz.service ssystemd unit.

Looking at the output of ps I can see that sh and sleep are actually children of a different process called conmon. I'm not sure where conmon came from, or how it was started but systemd didn't capture it.

# ps -Heo user,pid,ppid,comm
# ...
root     17254     1   podman
root     17331     1   conmon
root     17345 17331     sh
root     17380 17345       sleep

From the output it's clear that my baz.service unit is not managing the conmon -> sh -> sleep chain.

  • How is podman different from the docker client server model?
  • How is podman's conmon different from docker's containerd?

Maybe they are both container runtimes and the the dockerd daemon is what people people want to get rid of.

So maybe docker is like:

  • dockerd daemon
  • docker cli
  • containerd container runtime

And podman is like:

  • podman cli
  • conmon container runtime

So maybe podman uses a traditional fork exec model but it's not the podman cli that's forking and exec, it's the conmon process.

I feel confused.

Best Answer

The whole idea behind podman is to go away from the centralized architecture with the super-powerful overseer (e.g. dockerd), where the centralized daemon is a single point of failure. There even is a hashtag about this - "#nobigfatdaemons".

How to avoid the centralized container management? You remove the single main daemon (again, dockerd) and start the containers independently (at the end of the day, containers are just processes, so you don't need the daemon to spawn them).

However, you still need the way to

  • collect container's logs - someone has to hold stdout and stderr of the container;
  • collect container's exit code - someone has to wait(2) on container's PID 1;

For this purpose, each podman container is still supervised by a small daemon, called conmon (from "container monitor"). The difference with the Docker daemon is that this daemon is as small as possible (check the size of the source code), and it is spawned per-container. If conmon for one container crashes, the rest of the system stays unaffected.

Next, how the container gets spawned?

Considering that the user may want to run the container in the background, like with Docker, the podman run process forks twice and only then executes conmon:

$ strace -fe trace=fork,vfork,clone,execve -qq podman run alpine
execve("/usr/bin/podman", ["podman", "run", "alpine"], 0x7ffeceb01518 /* 30 vars */) = 0
...
[pid  8480] clone(child_stack=0x7fac6bffeef0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[8484], tls=0x7fac6bfff700, child_tidptr=0x7fac6bfff9d0) = 8484
...
[pid  8484] clone(child_stack=NULL, flags=CLONE_VM|CLONE_VFORK|SIGCHLD <unfinished ...>
[pid  8491] execve("/usr/bin/conmon", ... <unfinished ...>
[pid  8484] <... clone resumed>)        = 8491

The middle process between podman run and conmon (i.e. the direct parent of conmon - in the example above it is PID 8484) will exit and conmon will be reparented by init, thus becoming self-managed daemon. After this, conmon also forks off the runtime (e.g. runc) and, finally, the runtime executes the container's entrypoint (e.g. /bin/sh).

When the container is running, podman run is no longer required and may exit, but in your case it stays online, because you did not ask it to detach from the container.

Next, podman makes use of cgroups to limit the containers. This means that it creates new cgroups for new containers and moves the processes there. By the rules of cgroups, the process may be the member of only one cgroup at a time, and adding the process to some cgroup removes it from other cgroup (where it was previously) within the same hierarchy. So, when the container is started, the final layout of cgroups looks like the following: podman run remains in cgroups of the baz.service, created by systemd, the conmon process is placed in its own cgroups, and containerized processes are placed in their own cgroups:

$ ps axf
<...>
 1660 ?        Ssl    0:01 /usr/bin/podman run --rm --tty --name baz alpine sh -c while true; do date; sleep 1; done
 1741 ?        Ssl    0:00 /usr/bin/conmon -s -c 2f56e37a0c5ca6f4282cc4c0f4c8e5c899e697303f15c5dc38b2f31d56967ed6 <...>
 1753 pts/0    Ss+    0:02  \_ sh -c while true; do date; sleep 1; done
13043 pts/0    S+     0:00      \_ sleep 1
<...>

$ cd /sys/fs/cgroup/memory/machine.slice
$ ls -d1 libpod*
libpod-2f56e37a0c5ca6f4282cc4c0f4c8e5c899e697303f15c5dc38b2f31d56967ed6.scope
libpod-conmon-2f56e37a0c5ca6f4282cc4c0f4c8e5c899e697303f15c5dc38b2f31d56967ed6.scope

$ cat libpod-2f56e37a0c5ca6f4282cc4c0f4c8e5c899e697303f15c5dc38b2f31d56967ed6.scope/cgroup.procs 
1753
13075

$ cat libpod-conmon-2f56e37a0c5ca6f4282cc4c0f4c8e5c899e697303f15c5dc38b2f31d56967ed6.scope/cgroup.procs 
1741

Note: PID 13075 above is actually a sleep 1 process, spawned after the death of PID 13043.

Hope this helps.