iSCSI, Multipath, Device Mapper Problems

rene.losert · February 23, 2018, 1:37pm

Hallo zusammen,

wir habe aktuell bei einem Kunden ein Problem mit dem Device Mapper während der Backups.

Kurz zum Setup:

Wir haben eine zentrale Storage, welche per Open iSCSI auf beiden Servern eingerichtet ist.
Dahinter haben wir Multipath
und auf Multipath haben wir nochmals ein LVM drauf gepackt.

In diesen LVMs liegen die ganzen Harddisks für die VMs auf den Servern.

Zum sichern verwenden wir ein eigenes Script, welches von den besagten Disks einen LVM Snapshot erstellt, und diesen dann mit lzop weg kopiert.

Anschließend wird der LVM Snapshot wieder gelöscht.

Das ganze funktioniert soweit auch ganz gut.

Wir haben allerdings ganz unregelmäßig und auf unterschiedlichen LVMs das Problem, dass der Snapshot nicht gelöscht werden kann.

Im Syslog findet sich dazu folgender Eintrag:

Feb 22 02:28:04 kvm01 dmeventd[8663]: Failed to parse snapshot params: .
Feb 22 02:28:04 kvm01 kernel: [1600399.431930] Buffer I/O error on dev dm-28, logical block 26214384, async page read
Feb 22 02:28:04 kvm01 dmeventd[8663]: device-mapper: waitevent ioctl on  failed: No such device or address
Feb 22 02:28:04 kvm01 lvm[8663]: dm_task_run failed, errno = 6, No such device or address
Feb 22 02:28:04 kvm01 lvm[8663]: vg_vms_ssd-BACKUP_lv_vm_s--vwin32 disappeared, detaching
Feb 22 02:28:04 kvm01 lvm[8663]: No longer monitoring snapshot vg_vms_ssd-BACKUP_lv_vm_s--vwin32
Feb 22 02:28:04 kvm01 systemd-udevd[17530]: inotify_add_watch(6, /dev/dm-28, 10) failed: No such file or directory
Feb 22 02:28:04 kvm01 multipathd: vg_vms_ssd-BACKUP_lv_vm_s--vwin32: adding map
Feb 22 02:28:04 kvm01 multipathd: vg_vms_ssd-BACKUP_lv_vm_s--vwin32: devmap dm-28 added
Feb 22 02:28:04 kvm01 multipathd: dm-28: remove map (uevent)
Feb 22 02:28:04 kvm01 multipathd: dm-28: devmap not registered, can't remove
Feb 22 02:28:04 kvm01 multipathd: dm-28: remove map (uevent)
Feb 22 02:28:09 kvm01 multipathd: vg_vms_ssd-BACKUP_lv_vm_s--vwin32-cow: adding map
Feb 22 02:28:09 kvm01 multipathd: vg_vms_ssd-BACKUP_lv_vm_s--vwin32-cow: devmap dm-31 added
Feb 22 02:28:09 kvm01 multipathd: dm-31: remove map (uevent)
Feb 22 02:28:09 kvm01 multipathd: dm-31: devmap not registered, can't remove
Feb 22 02:28:09 kvm01 multipathd: dm-31: remove map (uevent)
Feb 22 02:28:15 kvm01 multipathd: dm-30: mapname not found for 254:30
Feb 22 02:28:15 kvm01 multipathd: uevent trigger error
Feb 22 02:28:15 kvm01 multipathd: dm-30: remove map (uevent)
Feb 22 02:28:15 kvm01 multipathd: dm-30: remove map (uevent)
Feb 22 02:28:15 kvm01 systemd-udevd[17652]: inotify_add_watch(6, /dev/dm-28, 10) failed: No such file or directory
Feb 22 02:28:15 kvm01 multipathd: vg_vms_ssd-BACKUP_lv_vm_s--vwin32: adding map
Feb 22 02:28:15 kvm01 multipathd: vg_vms_ssd-BACKUP_lv_vm_s--vwin32: devmap dm-28 added
Feb 22 02:28:15 kvm01 multipathd: dm-28: remove map (uevent)
Feb 22 02:28:15 kvm01 multipathd: dm-28: devmap not registered, can't remove
Feb 22 02:28:15 kvm01 multipathd: dm-28: remove map (uevent)

Hat von euch jemand eine Idee dazu woran das liegen könnte?

lg,
René

knebb · February 26, 2018, 10:36am

Hi,

sorry I can not really help. But your error message:

looks to me as there is an issue with multipath. For LVM it is an I/O error- so usually LVM can not write to the underlying device. It appears this device has been gone or not shown any more from the multipathd.

Have you verified it is not a performance issue (i.e. high load on storage)?

Otherwise I would try to ask the multipathd team…

/KNEBB

rene.losert · February 26, 2018, 11:56am

Hi Knebb,

hard to say … as told before this only occurs during the ‘lvm snapshot backups’.

So yes, there might be a higher I/O -Load then usual.

But the storage is connected with 2 direct 10-GBit load-balancing fibre (iSCSI) connections.
Load balancing is configured within multipath (active/active)

The backup runs over a normal GBit switch, so i don’t think that this will cause an I/O problem on the storage side, at least I think so.

Under normal day business times, we are not having any issues.

Best regards,
René

knebb · February 26, 2018, 12:08pm

Hi,

sure your storage is fast enough? I have seen guys having 4x10Gbit interfaces. But with slow S-ATA Disks as RAID5 configured with nearly no cache. Well, the bottleneck there has not been the network…

Again, it appears the underlying storage is possibly to slow (in terms of reaction time, not bandwidth). With LVM active snapshots there are some more operations to perform…

Sorry not to help any more. But that is how it appears to me. Have you tried active/passive? Just to rule out multipath is working properly?

/KNEBB

rene.losert · February 26, 2018, 12:25pm

Hi,

we are using a Fujitsu Eternus DX100 S4 with a SSD Raid 5 (7 disks) and SAS Raid 6 (10 disks @10k)

Although this might not be the best Raid-Config, i dont think that there should be a performance problem.

But i could try to switch back to active/passive.

Is it possible to switch to active/passive without down-time?

Best regards,
René

rene.losert · February 26, 2018, 12:26pm

and what do you mean here?

Best regards

knebb · February 26, 2018, 1:03pm

Hi,

I have in mind with LVM there has been some difficulties with underlying (non-physical) devices. I had an issue with LVM running on a drbd device and formatting it with ext3. Barriers or so… You might find a thread in the LVM mailing list.

I am pretty sure this might be something similar… but I can’t explain what is exactly is.

/KNEBB

rene.losert · March 6, 2018, 8:29am

Hi Knebb,

I think I found the Problem,

i forgot to blacklist all devices in Multipath, and create exceptions for the Storage, so multipath tried to handle the LVM-Snapshots, and thats false.

I will give you an update tomorow if this fixed it

blacklist {
       wwid .*
}
blacklist_exceptions {
       wwid    "3600000e00d2900000029201200000000"
       wwid    "3600000e00d2900000029201200010000"
}

Best Regards,
René