UCS Backup Node System hang after random hours of operation

Hello everyone,

Iam running a UCS Backup Node on a separate hardware to the Master Node.
The installation was running stable for about 3 years. 6 months ago the system started to hang randomly after several days, sometimes weeks of normal operation.

Hang:

  • The system can be pinged
  • All services unavailable
  • SSH login times out
  • Login directly at the machine TTY lets me input the username but then hangs before showing me the password prompt.

The occurance increased after the past weeks from once per month to practically once every 24h.

Kernel log, syslog, etc. do not show any error prior to the hang.

ct 11 00:25:59 magnesium systemd[1]: run-docker-runtime\x2drunc-moby-0e16607eacf1e65fd7021360b56a2c001771f7371afd314b976c32da31d49b1e-runc.mU5RXb.mount: Succeeded.
Oct 11 00:28:32 magnesium named[1713]: address not available resolving 'yr3_rPREoTIm3vJg4nrAFA339i.dbl.spamhaus.org/A/IN': 2a01:4f8:c17:aba4:d9:900b:3e:10f6#53
Oct 11 00:28:32 magnesium named[1713]: address not available resolving 'yr3_rPREoTIm3vJg4nrAFA339i.dbl.spamhaus.org/A/IN': 2a05:f480:2000:1246:9998:cf46:6cf8:1e7a#53
Oct 11 00:28:32 magnesium named[1713]: address not available resolving 'yr3_rPREoTIm3vJg4nrAFA339i.dbl.spamhaus.org/A/IN': 2a03:f80:70:213:183:54:56:e66b#53
Oct 11 00:28:32 magnesium named[1713]: address not available resolving 'yr3_rPREoTIm3vJg4nrAFA339i.dbl.spamhaus.org/A/IN': 2001:648:2000:340:68b:900d:39:b036#53
Oct 11 00:28:32 magnesium named[1713]: address not available resolving 'yr3_rPREoTIm3vJg4nrAFA339i.dbl.spamhaus.org/A/IN': 2a03:f80:3991:192:71:26:25:d0c6#53
Oct 11 00:28:32 magnesium named[1713]: address not available resolving 'yr3_rPREoTIm3vJg4nrAFA339i.dbl.spamhaus.org/A/IN': 2a03:b0c0:3:e0::283:a00e#53
Oct 11 00:28:32 magnesium named[1713]: address not available resolving 'yr3_rPREoTIm3vJg4nrAFA339i.dbl.spamhaus.org/A/IN': 2a00:12a8:8000::fff0:3#53
Oct 11 00:28:32 magnesium named[1713]: address not available resolving 'yr3_rPREoTIm3vJg4nrAFA339i.dbl.spamhaus.org/A/IN': 2a02:27ac::576#53
Oct 11 11:06:22 magnesium kernel: [    0.000000] Linux version 5.10.0-0.deb10.30-amd64 (debian-kernel@lists.debian.org) (gcc-8 (Debian 8.3.0-6) 8.3.0, GNU ld (GNU Binutils for Debian) 2.31.1) #1 SMP Debian 5.10.218-1~deb10u1 (2024-06-12)
Oct 11 11:06:22 magnesium kernel: [    0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.10.0-0.deb10.30-amd64 root=/dev/mapper/vg_ucs-root ro quiet
Oct 11 11:06:22 magnesium kernel: [    0.000000] BIOS-provided physical RAM map:

At 00:28 the hang occured, at 11:06 the next day I manually restarted the machine by pulling the plug.

In the last weeks I replaced RAM, SSD of the operating system and finally the motherboard without any change which leads me back to the software level.

Any ideas where to look next ?

Thx for the help !

Mastodon