iscsi initiator problems

From: Harry Schmalzbauer <freebsd_at_omnilan.de>
Date: Mon, 24 Jan 2022 08:29:59 UTC
Hi all,

I'm not sure if this list fit's for iscsi (initiator) problems.

But since I first need to get a clue about the following geom messages, 
I think the scsi experts probably know best about them:

g_dev_ioctl: offset=312526336 length=2048
last message repeated 4 times
g_dev_ioctl: offset=630456320 length=6707200
last message repeated 4 times
g_dev_ioctl: offset=9738045440 length=16777216
last message repeated 4 times
g_dev_ioctl: offset=9939372032 length=15710208
last message repeated 4 times
g_dev_ioctl: offset=312529408 length=7168
last message repeated 4 times
g_dev_ioctl: offset=9723707392 length=14338048
last message repeated 4 times
g_dev_ioctl: offset=406888448 length=4283904
:
g_dev_ioctl: offset=379645440 length=14749696
last message repeated 4 times
g_dev_ioctl: offset=27418112 length=16777216
:
g_dev_ioctl: offset=295853568 length=16675840
last message repeated 4 times
g_dev_ioctl: offset=16927744 length=4608
last message repeated 4 times
g_dev_ioctl: offset=10027428352 length=512
last message repeated 4 times
g_dev_ioctl: offset=27418112 length=16777216
:
g_dev_ioctl: offset=211967488 length=6787584
last message repeated 4 times
g_dev_ioctl: offset=406897152 length=7680
last message repeated 4 times
g_dev_ioctl: offset=9738045440 length=16777216
:
g_dev_ioctl: offset=10023258112 length=4170240
last message repeated 4 times
g_dev_ioctl: offset=630456320 length=6787584
last message repeated 4 times
g_dev_ioctl: offset=406904832 length=4267520
last message repeated 4 times
g_dev_ioctl: offset=406888448 length=8704
last message repeated 4 times
g_dev_ioctl: offset=312529408 length=2567168
last message repeated 4 times
g_dev_ioctl: offset=16927744 length=4608
last message repeated 4 times
g_dev_ioctl: offset=233229312 length=512
last message repeated 4 times
g_dev_ioctl: offset=630456320 length=6792192
last message repeated 4 times
g_dev_ioctl: offset=233230336 length=7680
last message repeated 4 times
g_dev_ioctl: offset=637248512 length=16777216

Skipped are notably many reading the following numbers:
kernel: g_dev_ioctl: offset=NOT4kaligned length=16777216 (many times 
more often occuring than the excerpt above)

The excerpt above is from messages of the iscsi INITIATOR (stable/13, 
amd64).
They occur if a bhyve(4) guest (win2k832) utilizes a virti-blk provided 
disk.

The other side is also FreeBSD stable/13 (amd64) and ctl/ctld(8) 
providing a ZVOL-dataset block backend.
In ctld(8), I define a blocksize of 4k (volblocksize is 64k) for the ctl 
target.

I'm using such kind of provider/_target_ setups for years _without any 
problems_ using ESXi or/and Windows (soft-)iSCSI-initiator.


The failure of the ctld-IP-iscsid->virtio-blk chain is completely 
unclear to me
and the only obvious info from the kernel error messages shown above is that
either one of "offset" or "length" is _not_ 4k aligned.

The guest OS is aware about 4k drives (passing virtio-blk with 
.sectorsize=4096) and I can transfer some hundred gigabytes, but at some 
point, the OS marks the disk as inaccessable.
Most likely due to the errors from above (kernel log of the bhyve(4) 
host acting as iscsi initiator).

I also never had such a issue when using the exactly same server/target
with the _Windows software iSCSI-initiator_, so I guess it's a problem 
with the FreeBSD iscsi initiator.

Unfortunately, before this error occured, I was hit by another, problem:
https://lists.freebsd.org/pipermail/freebsd-scsi/2017-June/007385.html

iscsi-initiator transfers over ethernet with JumboFrames (MTU of 9000) 
dont work at all.
Not for the reason that allocating jumbo mbufs failed, but with the same 
symptoms like the thread discusses.

Reducing mtu to 4096 by setting a host-route to each other on both sides 
solved the problem (at least to some degree).
When the messages from above arose, this host route was set to 1500 bytes.
I disabled JumboFrames on the NICs but didn't want to change it manually 
on the NICs since I currently have no maintenance window to check for 
unwanted side effects.
(if_lagg(4) has two i210 and LACP, providing if_vlan(4) children on both 
sides).

Since networking on all my machines using Intel NICs is broken to some 
degree since the latest August driver updates (stable/13), I can imagine 
NIC drops to be the culprit in this case too.  But I have no chance to 
rule that out nor confirm with testing (production env and neither time 
nor equipment handy to copy)...

What do you think, the error messages above are telling?


Thanks,

-harry