Ubuntu – How to prevent so many instances of apt-check running

aptmemory usage

I have an Ubuntu 12.04 server that just crashed because of a very obvious cause: 30+ of apt-check processes consuming all memory, the OOM killer kicking in, killing vital services. I'm not sure where the apt-check processes come from, but I guess my Nagios/Icinga plugins check_apt might use it, as well as the byobu status line may want to display its output. I guess something locked up and all of the processes were just waiting, yet holding memory.

How can I prevent to have so many instances of apt-check on the system? It doesn't make sense to me and it should just quit as soon as it can't get a read lock on the dpkg database.

It seems that I'm not the only one running into trouble here. All suggestions for apt-check are pretty negative:

enter image description here

(clean browser, not logged in, no personalised search)

Best Answer

Some dive into apt-check gave me these clues for being it a very blunt script that needs fixing. With all due respect to the authors of it, it is failing on my servers. Here are my thoughts:

  • apt-check == /usr/lib/update-notifier/apt_check.py
  • forces nicelevel 19 for itself
  • no timeouts set on actions

The combination of the last two allows it to pile up endlessly in a spiral downwards. If the system is used for some other purposes with higher priority, the amount of processes will just increase and there's no end to it, as apt-check will never get any priority over it. Trouble will only get worse once the OOM killer decides to kill your vital system processes.

If either of these two aspects in behaviour was different, it would not allow the system to end up in such a broken state is my assumption.

While strings is right about the parent processes being responsible in this too, I believe below points are flaws in apt-check and has to be reported as a bug to get addressed properly:

  • it should hint the OOM killer to have itself killed first
  • it should not set the nicelevel hardcoded
  • it should exit if it takes an unreasonable amount of time to get pieces of information

Actually, it seems that the Linux OOM killer is doing some heuristic on this. Niced processes will get an increased score, and long-running processes are decreased. (source - thanks to Ulrich Dangel for pointing it out)

Possible solution I may propose:

  • cache results after processing
  • output cache if less than N amount of seconds without loading all Python-APT libraries for every simple (even --help) invocation.
  • make the nicelevel configurable - Allow me to change/disable this, please! I believe that setting it to 0 will actually help
  • have it increase the OOM killer score