when I installed qemu-guest-agent, the system crashed more seriously, almost every time completely unresponsive, seemed the malfunctioning started around the on fsfreeze command (used for proxmox snapshots, which should take maximum some seconds or less)
Symptoms
After several hours (20-48) the system become slow or unresponsive.
I even found the loads like 305/305/305 while the cpu was only 0.47% used
Several dead processes: in one case the systemd-journald, in an other case: slapd etc., could be different
Common dead processes: jbd2/sdx : with different disks, seems random, could be the root cause of all the other issues, example (they happened different days, just example, sde is the ldap, while the sdc is the log partitions, that is why journald or slapd become dead):
ucs-6743 kernel: [29967.458782] jbd2/sde1-8 D 0 514 2 0x80000000
....
ucs-6743 kernel: [153095.267685] jbd2/sdc1-8 D 0 488 2 0x80000000
Assumptions
In some previous attempts, when I installed qemu-guest-agent, it seemed the fsfreeze triggered the problem, at least I did find after UCS reset in the logs.
Since I installed UCS without qemu-guest-agent, this problem still exists, but I am able login sometimes or I can use the already existing ssh connection to check the internal status of the server. Anyway, sometimes accessing the filesystem not possible, due to the jbd2/sd* issue.
So, at this point I think, the jbd2/sd* issue starts a chain reaction
→ random processes will be dead,
→ which cause various hangs/malfunctions/unresponsive system
Now I start to test with the following changes (one at a time):
UCS filesystem mount options: I removed the discard from all the filesystem previously I used (see above).
changing something in the proxmox level: async_io, discard, etc.
My question, do you have any clue, what could be going on? What did I miss?
Any hints are welcome
Even if it doesn’t help in the search for the problem, I would like to give some feedback.
We operate several PVEs. From an old 6.4.15 to the current 8.x. UCS 4.x - 5.x run on all of them. All with qemu-guest agent installed. I have not noticed any problems here.
q35 vs i440fx (as I remember, q35 also died before, I try to check)
proxmox filesystem: zfs (in my case) vs. anything else
proxmox: snapshots vs backup
numbers of disks and used partitions in UCS
probably lvm vs no-lvm (in my case) in UCS
mount options in UCS
It would be nice to know the following in your use cases:
proxmox volume manager used:
( ) zfs
( ) lvm
proxmox snapshots:
[ ] autosnap
[ ] pure zfs snapshot
[ ] no snapshot
other: ___________
Filesystem/vol manager used in UCS
[ ] lvm used
[ ] no lvm
[ ] ext4
[ ] xfs
UCS filesystem mount options
[ ] discard
[ ] noatime
[ ] relatime
other: ____
Otherwise I have no idea, why my setups are dying painfully, while a lot of other guests are running happily without any issue.
Additionaly, when I just installed the very first instance of UCS in dumb mode (one drive, automatic filesystem management, just click-click-click etc.), it was working for months and than I decided to install a production candidate and remove this messy test instance. My horror story started that time (of course, I did not backup that VM of a test machine, which considered an expandable one).
Cool, thank you for your feedback!
I try to narrow the gap between our configurations.
Sidenote: autosnap in Proxmox produce snapshots, which appear on the webGUI and besides of zfs snapshot, it handles other things as proxmox own snapshot does: config snapshot, fsfreeze via qemu-guest-config etc.
This could be a serious difference, as probably using native zfs snapshot never triggers any fsfreeze on the guest, even you use qemu-guest-agent.
At least, as I checked the logs, it seems the issue starts around the time the snapshots happen (cron: 5 * * * *).
Discard also suspicious, but I have less proof about it than autosnap.
proxmox-autosnap.py --autosnap --vmid all --label hourly --keep 23 --mute
In other words, this snapshot is not zfs snap -r rpool@hourly-2024-01-09-1705, it will use the qm snapshot <vmid> <snapname> [OPTIONS].
A pure (native) zfs snapshot has nothing to do with the kvm or lxc, only cares about the underlying zfs volume/filesystem. In this case the kvm/lxc does not know about any snapshots or any kind of actions, totally invisible for the guests → absolutely zero downtime.
In contrast, the qemu/lxc proxmox way of snapshot is a complex action and you cannot avoid it, when you make a backup with vzdump or using PBS (probably once per day).
The autosnap is working like the proxmox original snapshot utility, which used the qm/pct commands.
As I run autosnap hourly, I run into this risky situation 24 times a day, while the same amount of risky situation takes about 24 days for a standard user, who is using daily backups.
I also run UCS on a Proxmox 8.1.3 server with no issues. I set mine up with I think all the defaults for a VM except I chose host for CPU. Using LVM for the vm disk, no cache, discard is off, iothread is off, ssd is off, async_io=io_uring (default).
I also use the guest agent. I have automatic backups at night which does use fs_freeze and thaw, but I only do snapshots as needed when I upgrade or make changes. Only time I’ve had problems with backups is when my backup drive was failing. I use a 2.5" laptop drive for backups because it fits in the server and I somehow end up with a bunch of them. They tend to fail every year or so.
At this moment, without the discard mount option inside UCS, the server survived 25 snapshots (qm snapshot hourly) and working as expected. Yesterday it died in less than 13 hours.
Even I issued several fstrim without problem.
Anyway, based on the proxmox forum it is possible, the problem could caused by pve-qemu-kvm/iothread/virtio(scsi) in some cases.
I will upgrade my system to get a fresh kernel/kvm, probably this night, in the maintenance window.
Just an update: after 56 hours of running, the server still working as expected, I did not experienced any jbd2 error.
Reminder: only thing I did: I removed the discard mount option inside the guest from fstab.
Update: I upgraded my Proxmox, according to the proxmox forum.
That means, some qemu changes happened, related this kind of jbd2 lock issue.
Summary:
The test before this upgrade was running for more 58 hours without any issue. Before removing the discard mount option from the guest, during this time period usually the guest died already (I experienced 12-48 hours, usually less than 28 hours). Again, I do qm snapshot hourly, that means, 58 qm snapshot happened before.
If one does only one snapshot per day, during backup, that means almost 2 months running.
I started to test (use) the server and keep watching.
Anything happens, I will report back just for the record.
Update: after the server died less than 12 hours, I restarted the test with new changes: I removed the iothread from all disks.
Virtio-scsi-single still presents.
Qemu-guest-agent installed, just to push the limits and increasing the risk to crash.
It is working perfectly for 2 days and 22+ hours now. Promising.
Update: uptime is 5 days, 19:03, perfectly working, as expected.
It is safe to say, without iothread it is working well (reference: Recent kvm config shown in the previous post).
Summary:
Using UCS in Proxmox environment, it is necessary to double check the tuning used for kvm VM settings, as iothred + virtio-scsi-single could cause problems (lockups/hangups), which related to qemu internals.