Systemd Shutdown – What Exactly is a Stop Job?

shutdownsystemd

After a shutdown command is issued, sometimes one gets a status message like this:

A stop job is running for Session 1 of user xy

and then the system hangs for awhile, or forever depending on ???

So what exactly is "a stop job"?

Also, why does it sometimes estimate the time it will take, quite accurately, and other times it can run forever?

Best Answer

systemd operates internally in terms of a queue of "jobs". Each job (simplifying a little bit) is an action to take: stop, check, start, or restart a particular unit.

When (for example) you instruct systemd to start a service unit, it works out a list of stop and start jobs for whatever units (service units, mount units, device units, and so forth) are necessary for achieving that goal, according to unit requirements and dependencies, orders them, according to unit ordering relationships, works out and (if possible) fixes up any self-contradictions, and (if that final step is successful) places them in the queue.

Then it tries to perform the enqueued "jobs".

A stop job is running for Session 1 of user xy

The unit display name here is Session 1 of user xy. This will be (from the display name) a session unit, not a service unit. This is the user-space login session abstraction that is maintained by systemd's logind program and its PAM plugins. It is (in essence and in theory) a grouping of all of the processes that that user is running as a "login session" somewhere.

The job that has been enqueued against it is stop. And it's probably taking a long time because the systemd people have erroneously conflated session hangup with session shutdown. They break the former to get the latter to work, and in response some people alter systemd to break the latter to get the former to work. The systemd people really should recognize that they are two different things.

In your login session, you have something that ignores SIGTERM or that takes a long time to terminate once it has seen SIGTERM. Ironically, the former is the long-standing behaviour of some job-control shells. The correct way to terminate login session leaders when they are these particular job-control shells is to tell them that the session has been hung up, whereupon they terminate all of their jobs (a different kind of job to the internal systemd job) and then terminate themselves.

What's actually happening is that systemd is waiting the unit's stop timeout until it resorts to SIGKILL. This timeout is configurable per unit, of course, and can be set to never time out. Hence why one can potentially see different behaviours.

Further reading