Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)
killing at multiplay.co.uk
Sat Apr 20 15:50:41 UTC 2019
Have you eliminated geli as possible source?
I've just setup an old server which has a LSI 2008 running and old FW
(11.0) so was going to have a go at reproducing this.
Apart from the disconnect steps below is there anything else needed e.g.
read / write workload during disconnect?
mps0: <Avago Technologies (LSI) SAS2008> port 0xe000-0xe0ff mem
0xfaf3c000-0xfaf3ffff,0xfaf40000-0xfaf7ffff irq 26 at device 0.0 on pci3
mps0: Firmware: 11.00.00.00, Driver: 21.02.00.00-fbsd
On 20/04/2019 15:39, Karl Denninger wrote:
> I can confirm that 20.00.07.00 does *not* stop this.
> The previous write/scrub on this device was on 20.00.07.00. It was
> swapped back in from the vault yesterday, resilvered without incident,
> but a scrub says....
> root at NewFS:/home/karl # zpool status backup
> pool: backup
> state: DEGRADED
> status: One or more devices has experienced an unrecoverable error. An
> attempt was made to correct the error. Applications are unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
> using 'zpool clear' or replace the device with 'zpool replace'.
> see: http://illumos.org/msg/ZFS-8000-9P
> scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on Sat Apr
> 20 08:45:09 2019
> NAME STATE READ WRITE CKSUM
> backup DEGRADED 0 0 0
> mirror-0 DEGRADED 0 0 0
> gpt/backup61.eli ONLINE 0 0 0
> gpt/backup62-1.eli ONLINE 0 0 47
> 13282812295755460479 OFFLINE 0 0 0 was
> errors: No known data errors
> So this is firmware-invariant (at least between 19.00.00.00 and
> 20.00.07.00); the issue persists.
> Again, in my instance these devices are never removed "unsolicited" so
> there can't be (or at least shouldn't be able to) unflushed data in the
> device or kernel cache. The procedure is and remains:
> zpool offline .....
> geli detach .....
> camcontrol standby ...
> Wait a few seconds for the spindle to spin down.
> Remove disk.
> Then of course on the other side after insertion and the kernel has
> reported "finding" the device:
> geli attach ...
> zpool online ....
> If this is a boogered TXG that's held in the metadata for the
> "offline"'d device (maybe "off by one"?) that's potentially bad in that
> if there is an unknown failure in the other mirror component the
> resilver will complete but data has been irrevocably destroyed.
> Granted, this is a very low probability scenario (the area where the bad
> checksums are has to be where the corruption hits, and it has to happen
> between the resilver and access to that data.) Those are long odds but
> nonetheless a window of "you're hosed" does appear to exist.
More information about the freebsd-stable