Problem: At High Load the System Stops Logging

Christian_Voelker · November 28, 2018, 4:40pm

Problem:

At high log volume the system stops logging.

Environment

UCS 4.2 or newer with systemd.

You see entires like this in the logfile and after this entry logging stopped.
Nov 18 13:56:36 mail systemd[1]: systemd-journald.service watchdog timeout (limit 1min)!

Even restart of rsyslogd does not re-start logging. Only a server reboot reactivates logging.
journald writes it’s entries every five minutes to disc. At high load lot of entries have to be written to the disc and the write takes more than a minute. During write journald does not send watchdog updates which leads in a kill procedure from the systemd internal watchdog. So journald gets stopped and does not log any more.

Solution

Step 1

First, a reboot of the server is not needed. Just restart journald by:
systemctl restart systemd-journald
This should re-enable logging.

Step 2

The 1 minute timeout seems to be hardcoded so there is not much we can do. Non-internal services from systemd have a configuration setting WatchdogSec but it is not clear if this applies to systemd internal services. You can try to set it to a higher value in /etc/systemd/journald.conf by
WatchdogSec=180

Step 3

To increase performance you should consider to optimize the filesystem by any combination of the following

store /var/journal on a separate physical storage
put the filesystem on a very fast disk (ie SSD)
optimize filesystem speed with noatime,data=writeback for ext4. Be aware of the risks!