Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)

Sat Apr 20 15:50:41 UTC 2019

Have you eliminated geli as possible source?

I've just setup an old server which has a LSI 2008 running and old FW 
(11.0) so was going to have a go at reproducing this.

Apart from the disconnect steps below is there anything else needed e.g. 
read / write workload during disconnect?

mps0: <Avago Technologies (LSI) SAS2008> port 0xe000-0xe0ff mem 
0xfaf3c000-0xfaf3ffff,0xfaf40000-0xfaf7ffff irq 26 at device 0.0 on pci3
mps0: Firmware: 11.00.00.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities: 
185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR>

     Regards
     Steve

On 20/04/2019 15:39, Karl Denninger wrote:
> I can confirm that 20.00.07.00 does *not* stop this.
> The previous write/scrub on this device was on 20.00.07.00.  It was
> swapped back in from the vault yesterday, resilvered without incident,
> but a scrub says....
>
> root at NewFS:/home/karl # zpool status backup
>    pool: backup
>   state: DEGRADED
> status: One or more devices has experienced an unrecoverable error.  An
>          attempt was made to correct the error.  Applications are unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
>          using 'zpool clear' or replace the device with 'zpool replace'.
>     see: http://illumos.org/msg/ZFS-8000-9P
>    scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on Sat Apr
> 20 08:45:09 2019
> config:
>
>          NAME                      STATE     READ WRITE CKSUM
>          backup                    DEGRADED     0     0     0
>            mirror-0                DEGRADED     0     0     0
>              gpt/backup61.eli      ONLINE       0     0     0
>              gpt/backup62-1.eli    ONLINE       0     0    47
>              13282812295755460479  OFFLINE      0     0     0  was
> /dev/gpt/backup62-2.eli
>
> errors: No known data errors
>
> So this is firmware-invariant (at least between 19.00.00.00 and
> 20.00.07.00); the issue persists.
>
> Again, in my instance these devices are never removed "unsolicited" so
> there can't be (or at least shouldn't be able to) unflushed data in the
> device or kernel cache.  The procedure is and remains:
>
> zpool offline .....
> geli detach .....
> camcontrol standby ...
>
> Wait a few seconds for the spindle to spin down.
>
> Remove disk.
>
> Then of course on the other side after insertion and the kernel has
> reported "finding" the device:
>
> geli attach ...
> zpool online ....
>
> Wait...
>
> If this is a boogered TXG that's held in the metadata for the
> "offline"'d device (maybe "off by one"?) that's potentially bad in that
> if there is an unknown failure in the other mirror component the
> resilver will complete but data has been irrevocably destroyed.
>
> Granted, this is a very low probability scenario (the area where the bad
> checksums are has to be where the corruption hits, and it has to happen
> between the resilver and access to that data.)  Those are long odds but
> nonetheless a window of "you're hosed" does appear to exist.
>