Debugging lock-up – systemd loses the logs

debugginglogssystemdsystemd-journald

Ever since I have "upgraded" to systemd on Arch Linux, I keep losing logs when an unexpected lockup happens. I hit the same log-losing issue one month ago and just hit the issue again. There are also independent other confirmations.

Situation:

  • While doing some stuff in Java and with network-related utilities, I saw that KDE (the clock) was frozen. The CPU fan became noisy and the heat was rising. The mouse pointer could still be moved though.
  • I tried to ssh from another machine (failed due to "no route to host")
  • I waited a few minutes, perhaps the NMI watchdog could kill the offending task. No dice.
  • Ctrl + Alt + F1 did not work either, even after SysRq + R
  • Since the above steps did not work, I decided to issue the SysRq sequence R-E-I. After E, the screen became black, but no console either. Not even after SysRq + K
  • So, this session appears to be lost, the only thing that can be done is gathering debugging information. Looking at Wikipedia, I decided to press SysRq + d (display held locks) among some others.
  • After pressing SysRq + S I waited a second and then rebooted with SysRq + B.
  • After rebooting and logging into a console, I saw no traces of any crash. The most recently logged entry was from using Wireshark, but there was still a gap of 45 minutes.

(I was running Linux v3.8-rc5-218-ga56e160 btw)

So, how can I make sure that my logs are retained when abnormally rebooting due to a lockup?

Best Answer

So I asked on the #systemd IRC channel and it turns out that journald (the logging daemon of systemd) does not periodically flush the logs to disk at all. This means that your logs are always at risk at any time.

Sending SIGUSR2 to the journald causes logs to be written to disk, but if you do this multiple times, many files will be created. (the option is actually described as "log rotating").

In the end, I decided to go with another suggestion: using a dedicated syslog daemon for collecting kernel logs. As rsyslog was suggested (and I had already experience with it), I explored that option further. I have written some more details in the Arch Wiki about using rsyslog.

The idea is to run rsyslog, collecting only data from the kernel facility. As rsyslog reads from /proc/kmsg (which allows only a single reader) and journald reads from /dev/kmsg (multiple readers allowed), there is no way that the daemons lose logs (very important to me!). Configure rsyslog to write kernel messages to a file and make sure that this file is rotated to prevent eating your disk space.

This solution is not perfect:

  • Other logs (for example, from NetworkManager) are lost. This could be solved by forwarding more logs from syslog to journald (this means duplication!)
  • Duplication of logs. The kernel messages are written to two files. This is a non-issue, in general the number of logs are small and you would rather have more copies of the logs than none. You can also use fast tools like grep on the single log file or the more slower, but fancier journalctl.

There is a TODO item for flushing logs more frequently, but that is still not reliable enough:

journal: send out marker messages every now and then, and immediately sync with fdatasync() afterwards, in order to have hourly guaranteed syncs.

Now, hopefully systemd/journald will get an option to write the logs to disk, but meanwhile we can combine tools to achieve the goal.

Related Question