Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)

Sat Apr 20 20:56:49 UTC 2019

Thanks for extra info, the next question would be have you eliminated 
that corruption exists before the disk is removed?

Would be interesting to add a zpool scrub to confirm this isn't the case 
before the disk removal is attempted.

     Regards
     Steve

On 20/04/2019 18:35, Karl Denninger wrote:
>
> On 4/20/2019 10:50, Steven Hartland wrote:
>> Have you eliminated geli as possible source?
> No; I could conceivably do so by re-creating another backup volume set 
> without geli-encrypting the drives, but I do not have an extra set of 
> drives of the capacity required laying around to do that. I would have 
> to do it with lower-capacity disks, which I can attempt if you think 
> it would help.  I *do* have open slots in the drive backplane to set 
> up a second "test" unit of this sort.  For reasons below it will take 
> at least a couple of weeks to get good data on whether the problem 
> exists without geli, however.
>>
>> I've just setup an old server which has a LSI 2008 running and old FW 
>> (11.0) so was going to have a go at reproducing this.
>>
>> Apart from the disconnect steps below is there anything else needed 
>> e.g. read / write workload during disconnect?
>
> Yes.  An attempt to recreate this on my sandbox machine using smaller 
> disks (WD RE-320s) and a decent amount of read/write activity (tens to 
> ~100 gigabytes) on a root mirror of three disks with one taken offline 
> did not succeed.  It *reliably* appears, however, on my backup volumes 
> with every drive swap. The sandbox machine is physically identical 
> other than the physical disks; both are Xeons with ECC RAM in them.
>
> The only operational difference is that the backup volume sets have a 
> *lot* of data written to them via zfs send|zfs recv over the 
> intervening period where with "ordinary" activity from I/O (which was 
> the case on my sandbox) the I/O pattern is materially different.  The 
> root pool on the sandbox where I tried to reproduce it synthetically 
> *is* using geli (in fact it boots native-encrypted.)
>
> The "ordinary" resilver on a disk swap typically covers ~2-3Tb and is 
> a ~6-8 hour process.
>
> The usual process for the backup pool looks like this:
>
> Have 2 of the 3 physical disks mounted; the third is in the bank vault.
>
> Over the space of a week, the backup script is run daily.  It first 
> imports the pool and then for each zfs filesystem it is backing up 
> (which is not all of them; I have a few volatile ones that I don't 
> care if I lose, such as object directories for builds and such, plus 
> some that are R/O data sets that are backed up separately) it does:
>
> If there is no "... at zfs-base": zfs snapshot -r ... at zfs-base; zfs send 
> -R ... at zfs-base | zfs receive -Fuvd $BACKUP
>
> else
>
> zfs rename -r ... at zfs-base ... at zfs-old
> zfs snapshot -r ... at zfs-base
>
> zfs send -RI ... at zfs-old ... at zfs-base |zfs recv -Fudv $BACKUP
>
> .... if ok then zfs destroy -vr ... at zfs-old otherwise print a 
> complaint and stop.
>
> When all are complete it then does a "zpool export backup" to detach 
> the pool in order to reduce the risk of "stupid root user" (me) accidents.
>
> In short I send an incremental of the changes since the last backup, 
> which in many cases includes a bunch of automatic snapshots that are 
> taken on frequent basis out of the cron. Typically there are a week's 
> worth of these that accumulate between swaps of the disk to the vault, 
> and the offline'd disk remains that way for a week.  I also wait for 
> the zpool destroy on each of the targets to drain before continuing, 
> as not doing so back in the 9 and 10.x days was a good way to 
> stimulate an instant panic on re-import the next day due to kernel 
> stack page exhaustion if the previous operation destroyed hundreds of 
> gigabytes of snapshots (which does routinely happen as part of the 
> backed up data is Macrium images from PCs, so when a new month comes 
> around the PC's backup routine removes a huge amount of old data from 
> the filesystem.)
>
> Trying to simulate the checksum errors in a few hours' time thus far 
> has failed.  But every time I swap the disks on a weekly basis I get a 
> handful of checksum errors on the scrub. If I export and re-import the 
> backup mirror after that the counters are zeroed -- the checksum error 
> count does *not* remain across an export/import cycle although the 
> "scrub repaired" line remains.
>
> For example after the scrub completed this morning I exported the pool 
> (the script expects the pool exported before it begins) and ran the 
> backup.  When it was complete:
>
> root at NewFS:~/backup-zfs # zpool status backup
>   pool: backup
>  state: DEGRADED
> status: One or more devices has been taken offline by the administrator.
>         Sufficient replicas exist for the pool to continue functioning 
> in a
>         degraded state.
> action: Online the device using 'zpool online' or replace the device with
>         'zpool replace'.
>   scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on Sat 
> Apr 20 08:45:09 2019
> config:
>
>         NAME                      STATE     READ WRITE CKSUM
>         backup                    DEGRADED     0 0     0
>           mirror-0                DEGRADED     0 0     0
>             gpt/backup61.eli      ONLINE       0 0     0
>             gpt/backup62-1.eli    ONLINE       0 0     0
>             13282812295755460479  OFFLINE      0 0     0  was 
> /dev/gpt/backup62-2.eli
>
> errors: No known data errors
>
> It knows it fixed the checksums but the error count is zero -- I did 
> NOT "zpool clear".
>
> This may have been present in 11.2; I didn't run that long enough in 
> this environment to know.  It definitely was *not* present in 11.1 and 
> before; the same data structure and script for backups has been in use 
> for a very long time without any changes and this first appeared when 
> I upgraded from 11.1 to 12.0 on this specific machine, with the exact 
> same physical disks being used for over a year (they're currently 6Tb 
> units; the last change out for those was ~1.5 years ago when I went 
> from 4Tb to 6Tb volumes.)  I have both HGST-NAS and He-Enterprise 
> disks in the rotation and both show identical behavior so it doesn't 
> appear to be related to a firmware problem in one disk .vs. the other 
> (e.g. firmware that fails to flush the on-drive cache before going to 
> standby even though it was told to.)
>
>>
>> mps0: <Avago Technologies (LSI) SAS2008> port 0xe000-0xe0ff mem 
>> 0xfaf3c000-0xfaf3ffff,0xfaf40000-0xfaf7ffff irq 26 at device 0.0 on pci3
>> mps0: Firmware: 11.00.00.00, Driver: 21.02.00.00-fbsd
>> mps0: IOCCapabilities: 
>> 185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR>
>>
>>     Regards
>>     Steve
>>
>> On 20/04/2019 15:39, Karl Denninger wrote:
>>> I can confirm that 20.00.07.00 does *not* stop this.
>>> The previous write/scrub on this device was on 20.00.07.00. It was
>>> swapped back in from the vault yesterday, resilvered without incident,
>>> but a scrub says....
>>>
>>> root at NewFS:/home/karl # zpool status backup
>>>    pool: backup
>>>   state: DEGRADED
>>> status: One or more devices has experienced an unrecoverable error.  An
>>>          attempt was made to correct the error.  Applications are 
>>> unaffected.
>>> action: Determine if the device needs to be replaced, and clear the 
>>> errors
>>>          using 'zpool clear' or replace the device with 'zpool 
>>> replace'.
>>>     see: http://illumos.org/msg/ZFS-8000-9P
>>>    scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on Sat 
>>> Apr
>>> 20 08:45:09 2019
>>> config:
>>>
>>>          NAME                      STATE     READ WRITE CKSUM
>>>          backup                    DEGRADED     0     0     0
>>>            mirror-0                DEGRADED     0     0     0
>>>              gpt/backup61.eli      ONLINE       0     0     0
>>>              gpt/backup62-1.eli    ONLINE       0     0    47
>>>              13282812295755460479  OFFLINE      0     0     0 was
>>> /dev/gpt/backup62-2.eli
>>>
>>> errors: No known data errors
>>>
>>> So this is firmware-invariant (at least between 19.00.00.00 and
>>> 20.00.07.00); the issue persists.
>>>
>>> Again, in my instance these devices are never removed "unsolicited" so
>>> there can't be (or at least shouldn't be able to) unflushed data in the
>>> device or kernel cache.  The procedure is and remains:
>>>
>>> zpool offline .....
>>> geli detach .....
>>> camcontrol standby ...
>>>
>>> Wait a few seconds for the spindle to spin down.
>>>
>>> Remove disk.
>>>
>>> Then of course on the other side after insertion and the kernel has
>>> reported "finding" the device:
>>>
>>> geli attach ...
>>> zpool online ....
>>>
>>> Wait...
>>>
>>> If this is a boogered TXG that's held in the metadata for the
>>> "offline"'d device (maybe "off by one"?) that's potentially bad in that
>>> if there is an unknown failure in the other mirror component the
>>> resilver will complete but data has been irrevocably destroyed.
>>>
>>> Granted, this is a very low probability scenario (the area where the 
>>> bad
>>> checksums are has to be where the corruption hits, and it has to happen
>>> between the resilver and access to that data.)  Those are long odds but
>>> nonetheless a window of "you're hosed" does appear to exist.
>>>
>>
> -- 
> Karl Denninger
> karl at denninger.net <mailto:karl at denninger.net>
> /The Market Ticker/
> /[S/MIME encrypted email preferred]/