Debian – Unpredictable memory explosions

debiandebugginglinuxmemorymonitoring

The main server at my company has recently been having a lot of downtime. For reasons that neither I nor the other admins can determine, it has random (VERY sudden) explosions in memory. It becomes unresponsive because it exhausts all the memory, and then we have to reboot it. Very annoying. It's a Debian system, we haven't upgraded to Squeeze or anything, it's been perfectly stable for a long time.

The problem is that the logs are totally useless. They don't seem to indicate that anything is going wrong. I'm guessing that some process is buggy and hogging all of the memory, but I have NO way of proving that at the moment. Remote logging is no help, because it's not complaining about anything — it thinks everything is peachy.

So my question is: how would you approach this problem? Any insight is appreciated. Thanks.

Best Answer

atop is pretty good at monitoring and logging resource usage. It can be used interactively or as a service; the debian package sets it to log to /var/log/atop.log every ten minutes (edit /etc/init.d/atop for something more precise). You can then replay the logs with atop -r /var/log/atop.log -b hh:mm -mM; mM selects a view and a sort appropriate for memory problems, hh:mm should be a few minutes before the incident, use tT to navigate. Also try the A sort.

Related Question