ZFS stalled after some mirror disks were lost
Ben RUBSON
ben.rubson at gmail.com
Fri Oct 13 16:58:35 UTC 2017
On 12 Oct 2017 22:52, InterNetX - Juergen Gotteswinter wrote:
> Ben started a discussion about his setup a few months ago, where he
> described what he is going to do. And, at least my (and i am pretty sure
> there where others, too) prophecy was that it will end up in a pretty
> unreliable setup (gremlins and other things are included!) which is far
> far away from being helpful in term of HA. A single node setup, with
> reliable hardware configuration and as little as possible moving parts,
> whould be way more reliable and flawless.
First, thank you for your answer Juergen, I do appreciate it, of course, as
well as the help you propose below. So thank you ! :)
Yes, the discussion, more than one year ago now, on this list, was named
"HAST + ZFS + NFS + CARP", but quickly moved around iSCSI when I initiated
the idea :)
I must say, after one year of production, that I'm rather happy with this
setup.
It works flawlessly (but the current discussed issue of course), and I
switched from one node to another several times, successfully.
Main purpose of the previous discussion was to have a second chassis to
host the pool in case of a failure with the first one.
> (...) if the underlying block device works like expected a
> error should be returned to zfs, but noone knows what happens in this
> setup during failure. maybe its some switch issue, or network driver bug
> which prevents this and stalls the pool.
The issue only happens when I disconnect iSCSI drives, it does not occurs
suddenly by itself.
So I would say the issue is on FreeBSD side, not network hardware :)
2 distinct behaviours/issues :
- 1 : when I disconnect iSCSI drives from the server running the pool
(iscsictl -Ra), some iSCSI drives remain on the system, leaving ZFS stalled
;
- 2 : when I disconnect iSCSI drives from the target (shut NIC down /
shutdown ctld), server running the pool sometimes panics (traces in my
previous mail, 06/10).
> @ben
>
> can you post your iscsi network configuration including ctld.conf and so
> on? is your iscsi setup using multipath, lacp or is it just single pathed?
Sure. So I use single pathed iSCSI.
### Target side :
# /etc/ctl.conf (all targets are equally configured) :
target iqn.............:hm1 {
portal-group pg0 au0
alias G1207FHRDP2SThm
lun 0 {
path /dev/gpt/G1207FHRDP2SThm
serial G1207FHRDP2SThm
}
}
So, each target has its GPT label to clearly & quickly identify it
(location, serial).
Best practice, which I find to be very useful, taken from storage/ZFS books
from Michael W Lucas & Allan Jude.
### Initiator side :
# /etc/iscsi.conf (all targets are equally configured) :
hm1 {
InitiatorName = iqn.............
TargetAddress = 192.168.2.2
TargetName = iqn.............:hm1
HeaderDigest = CRC32C
DataDigest = CRC32C
}
Then, each disk is geom-labeled (glabel label) so that the previous example
disk appears on initiator side in :
/dev/label/G1207FHRDP2SThm
Still the naming best practice which allows me to identify a disk wherever
it is, without mistake.
ZFS thus uses /dev/label/ paths.
NAME STATE READ WRITE CKSUM
home ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
label/G1203_serial_hm ONLINE 0 0 0
label/G1204_serial_hm ONLINE 0 0 0
label/G2203_serial_hm ONLINE 0 0 0
label/G2204_serial_hm ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
label/G1205_serial_hm ONLINE 0 0 0
label/G1206_serial_hm ONLINE 0 0 0
label/G2205_serial_hm ONLINE 0 0 0
label/G2206_serial_hm ONLINE 0 0 0
cache
label/G2200_serial_ch ONLINE 0 0 0
label/G2201_serial_ch ONLINE 0 0 0
(G2* are local disks, G1* are iCSCI disks)
> overall...
>
> ben is stacking up way too much layers which prevent root cause diagnostic.
>
> lets see, i am trying to describe what i see here (please correct me if
> this setup is different from my thinking). i think, to debug this mess,
> its absolutely necessary to see all involved components
>
> - physical machine(s), exporting single raw disks via iscsi to
> "frontends" (please provide exact configuration, software versions, and
> built in hardware -> especially nic models, drivers, firmware)
>
> - frontend boxes, importing the raw iscsi disks for a zpool (again,
> exact hardware configuration, network configuration, driver / firmware
> versions and so on)
Exact same hardware and software on both sides.
FreeBSD 11.0-RELEASE-p12
SuperMicro motherboard
ECC RAM
NIC Mellanox ConnectX-3 40G fw 2.36.5000
HBA SAS 2008 LSI 9211-8i fw 20.00.07.00-IT
SAS-only disks (no SATA)
> - switch infrastructure (brand, model, firmware version, line speed,
> link aggregation in use? if yes, lacp or whatever is in use here?)
>
> - single switch or stacked setup?
>
> did one already check the switch logs / error counts?
No, as issue seems to come from FreeeBSD (easily reproductible with the 2
scenarios I gave above).
> another thing which came to my mind is, if has zfs ever been designed to
> be used on top of iscsi block devices? my thoughts so far where that zfs
> loves native disks, without any layer between (no volume manager, no
> partitions, no nothing). most ha setups i have seen so far where using
> rock solid cross over cabled sas jbods with on demand activated paths in
> case of failure. theres not that much that can cause voodoo in such
> setups, compared to iscsi ha however failover scenarios with tons of
> possible problematic components in between.
We analyzed this in the previous topic :
https://lists.freebsd.org/pipermail/freebsd-fs/2016-July/023503.html
https://lists.freebsd.org/pipermail/freebsd-fs/2016-July/023527.html
>>> Did you try to reproduce the problem without iSCSI?
>
> i bet the problem wont occur anymore on native disks. which should NOT
> mean that zfs cant be used on iscsi devices, i am pretty sure it will
> work fine... as long as:
>
> - iscsi target behavior is doing well, which includes that no strange
> bugs start partying on your san network
> (...)
Andriy, who took many debug traces from my system, managed to reproduce the
first issue locally, using a 3-way ZFS mirror with one local disk plus two
iSCSI disks.
Sounds like there is a deadlock issue on iSCSI initiator side (of course
Andriy feel free to correct me if I'm wrong).
Regarding the second issue, I'm not able to reproduce it if I don't use
geom-labels.
There may then be an issue on geom-label side (which could then also affect
fully-local ZFS pools using geom-labels).
Thank you again,
Ben
More information about the freebsd-fs
mailing list