Linux – systemd reboot threshold limit

linuxrebootservicessystemd

Related:
Limit system reboot burst

I'm working for a commercial product that runs a camera service. This service is critical for the normal functionality of the system. So far, it is going good and I'm able to restart the service if it fails due to low-level protocol/driver issues. Here is a snippet from .service unit file that deals with the service restart and reboot logic.

...
[service]
Restart=on-failure
StartLimitInterval=2min
StartLimitBurst=5
StartLimitAction=reboot-force
...

Under certain conditions (for example: bus rail faults), it is quite possible that any number of reboots wouldn't help recover the system. In this situation, we want to stop rebooting the device (as it could be annoying to the user) and stop all attempts to recover the camera pipelines. This can be achieved using a monitoring service that just keeps track of the number of reboots the device went through, before stopping further reboots.

The other option, I thought is to depend on systemd, instead of adding another monitoring service for this purpose alone (which in turn would be monitored by systemd). I have spent some time to look for systemd options, reading through the documentations/examples to see if such reboot-thresholds exist. I'm looking for a way to restrict the number of reboots to some configurable StartLimitReboot

tl;dr

I want to achieve something like this

...
[service]
... 
...
... 
StartLimitReboot=3 # stop rebooting after this limit
...

Looks like systemd doesn't support such a semantics as of now, but if it supports, that would simplify my task substantially.

Best Answer

No, systemd doesn't offer a feature to manage a number of reboots and then stop rebooting.

Consider a case where your app fails twice and triggers reboots two via StartActionLimit and remains stable for two weeks, then triggers a third reboot this way much later. Would you expect the theoretical StartLimitReboot=3 to trigger in this case?

If not, there has to be some timeout value to expire the "reboot counter". This is unlike the timer for how fast a service reboots, because the reboot timer would have to factor in how long it takes for the machine to boot before it even tries to start the service again.

Also, if a system is stuck boot-loop due to a critical service failure, does it even make sense to keep machine on if the critical service isn't working, or should it just give up and poweroff after this?

While I can see the interest in having systemd help here, I don't expect this feature to appear to soon.

Related Question