Capture Linux kernel crash information

Problem:

The Linux kernel crashed repeatedly. The crash information is to be collected for further analysis.

Solution:

  1. Follow Analyze boot problems to disable the spash screen (and to setup a serial console)

     ucr set grub/bootsplash=nosplash grub/quiet=no grub/loglevel=7
    
  2. Follow Disable console blanking to disable console blanking (first check if the package console-tools is installed, or install it via univention-install console-tools):

     sed -i -re 's/^#?((BLANK|POWERDOWN)_TIME=).*/\10/;s/^#?(BLANK_DPMS=).*/\1off/' /etc/kbd/config /etc/console-tools/config
    
  3. Configure a central syslog server to collect syslog messages from all/other hosts in your domain over the network. This is described in Setup central syslog server

  4. Make sure SysRq is enabled, which allows capturing data even when the system is no longer reachable via network:

     ucr set grub/append="$(ucr get grub/append) sysrq_always_enabled=1"
    

    See: Was kann man als letzten Ausweg vor einem Hard-Reset noch tun, falls ein Linux-System nicht mehr zu reagieren scheint und ggf. neu gestartet werden soll?

  5. Configure the Linux kernel to not reboot automatically after a panic:

     ucr set grub/append="$(ucr get grub/append) panic=0"
    
  6. Optionally: Configure the Linux kernel to handle all OOPSes as panics; otherwise the kernel keeps running and might crash later on after the inconsistent state has caused further problems:

     ucr set grub/append="$(ucr get grub/append) oops=panic"
    
  7. Xen hypervisor only: If the Xen hypervisor is used, a crash of the dom0 kernel will lead to a reboot of the hypervisor.
    This should be disabled:

     ucr set grub/xenhopt="$(ucr get grub/xenhopt) noreboot=yes"
    

    The number of rows of the text console can be increased by choosing a different text mode:

     ucr set grub/xenhopt="$(ucr get grub/xenhopt) vga=text-80x50"
    

    For Xen-PV-domUs you need to re-configure the “on_crash” behavior to “coredump-restart” or “coredump-destroy” using “virsh edit” or similar; See Events configuration for a detailed description.

  8. Expert: Configure a crash-dump-kernel to dump the crashed kernel state to another host. (for Xen see above as kexec does not work with Xen as described here)
    This requires a second host (called ‘collector’ here) with enough disk space to receive and store the crash-dump, which consists of the complete RAM (unencrypted, may contain credentials) of the crashing server. Depending on the RAM size and network speed this can take several minutes until the crashed server is automatically rebootet.

    1. Download the latest attachment from Bug #25918 and copy it to all hosts.
    2. Run it once on the ‘collector’ with the parameter -c to setup the listening service on TCP port 666. Dumps are stored in the directory /var/lib/crash/ unless a different directory is given using the option -o.
    3. Run it on all other servers passing the FQDN of the collector. Depending the on number of kernel modules needed to boot the server it might be necessary to increase the size reserved for the crash-kernel using the option -s.

    The such captured crash-kernel should be sent to Univention on request for further analysis. Please include information about the Linux-kernel-version and architecture:

     dpkg-query -W "linux-image-$(uname -r)"
Mastodon