Hi,
I evaluate UCS for a while and in the past few weeks I found a strange issue with it.
Let me describe my setup:
Physical server
Proxmox 8.1.3 environment on ZFS, UCS on a KVM machine
The KVM machine parameters:
- 6GB RAM (balloon=0), CPU: 8 (host)
- BIOS: OVMF (UEFI); Machine: i440fx; SCSI controller: VirtIO SCSI single
- 7 disks (zfs vols), separated disks for different purposes, common setup: cache=writeback, discard=on, iothread=1, ssd=1, async_io=io_uring
Relevant regular task on proxmox:
automatic guest snapshot using proxmox own snapshot tool (not only a zfs snap), every hour
UCS KVM guest fstab
The UCS 5.0-6 installation seems nothing special, lvm not used, different functions on different disks/partitions:
/ ext4 errors=remount-ro,user_xattr
/boot/efi vfat umask=0077
/home ext4 discard,noatime,user_xattr,usrquota
/var/flexshares ext4 discard,noatime,user_xattr
/var/lib/univention-ldap ext4 discard,noatime,user_xattr
/var/log ext4 discard,noatime,user_xattr
/var/univention-backup ext4 discard,noatime,user_xattr
16GB swap partition
- qemu-guest-agent was not installed
- when I installed qemu-guest-agent, the system crashed more seriously, almost every time completely unresponsive, seemed the malfunctioning started around the on fsfreeze command (used for proxmox snapshots, which should take maximum some seconds or less)
Symptoms
- After several hours (20-48) the system become slow or unresponsive.
- I even found the loads like 305/305/305 while the cpu was only 0.47% used
- Several dead processes: in one case the systemd-journald, in an other case: slapd etc., could be different
- Common dead processes: jbd2/sdx : with different disks, seems random, could be the root cause of all the other issues, example (they happened different days, just example, sde is the ldap, while the sdc is the log partitions, that is why journald or slapd become dead):
ucs-6743 kernel: [29967.458782] jbd2/sde1-8 D 0 514 2 0x80000000
....
ucs-6743 kernel: [153095.267685] jbd2/sdc1-8 D 0 488 2 0x80000000
Assumptions
- In some previous attempts, when I installed qemu-guest-agent, it seemed the fsfreeze triggered the problem, at least I did find after UCS reset in the logs.
- Since I installed UCS without qemu-guest-agent, this problem still exists, but I am able login sometimes or I can use the already existing ssh connection to check the internal status of the server. Anyway, sometimes accessing the filesystem not possible, due to the jbd2/sd* issue.
- So, at this point I think, the jbd2/sd* issue starts a chain reaction
→ random processes will be dead,
→ which cause various hangs/malfunctions/unresponsive system
Now I start to test with the following changes (one at a time):
- UCS filesystem mount options: I removed the discard from all the filesystem previously I used (see above).
- changing something in the proxmox level: async_io, discard, etc.
My question, do you have any clue, what could be going on? What did I miss?
Any hints are welcome
Thanks,
István