Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)

Sat Apr 20 14:39:29 UTC 2019

On 4/13/2019 06:00, Karl Denninger wrote:
> On 4/11/2019 13:57, Karl Denninger wrote:
>> On 4/11/2019 13:52, Zaphod Beeblebrox wrote:
>>> On Wed, Apr 10, 2019 at 10:41 AM Karl Denninger <karl at denninger.net> wrote:
>>>
>>>
>>>> In this specific case the adapter in question is...
>>>>
>>>> mps0: <Avago Technologies (LSI) SAS2116> port 0xc000-0xc0ff mem
>>>> 0xfbb3c000-0xfbb3ffff,0xfbb40000-0xfbb7ffff irq 30 at device 0.0 on pci3
>>>> mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
>>>> mps0: IOCCapabilities:
>>>> 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
>>>>
>>>> Which is indeed a "dumb" HBA (in IT mode), and Zeephod says he connects
>>>> his drives via dumb on-MoBo direct SATA connections.
>>>>
>>> Maybe I'm in good company.  My current setup has 8 of the disks connected
>>> to:
>>>
>>> mps0: <Avago Technologies (LSI) SAS2308> port 0xb000-0xb0ff mem
>>> 0xfe240000-0xfe24ffff,0xfe200000-0xfe23ffff irq 32 at device 0.0 on pci6
>>> mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
>>> mps0: IOCCapabilities:
>>> 5a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc>
>>>
>>> ... just with a cable that breaks out each of the 2 connectors into 4
>>> SATA-style connectors, and the other 8 disks (plus boot disks and SSD
>>> cache/log) connected to ports on...
>>>
>>> - ahci0: <ASMedia ASM1062 AHCI SATA controller> port
>>> 0xd050-0xd057,0xd040-0xd043,0xd030-0xd037,0xd020-0xd023,0xd000-0xd01f mem
>>> 0xfe900000-0xfe9001ff irq 44 at device 0.0 on pci2
>>> - ahci2: <Marvell 88SE9230 AHCI SATA controller> port
>>> 0xa050-0xa057,0xa040-0xa043,0xa030-0xa037,0xa020-0xa023,0xa000-0xa01f mem
>>> 0xfe610000-0xfe6107ff irq 40 at device 0.0 on pci7
>>> - ahci3: <AMD SB7x0/SB8x0/SB9x0 AHCI SATA controller> port
>>> 0xf040-0xf047,0xf030-0xf033,0xf020-0xf027,0xf010-0xf013,0xf000-0xf00f mem
>>> 0xfea07000-0xfea073ff irq 19 at device 17.0 on pci0
>>>
>>> ... each drive connected to a single port.
>>>
>>> I can actually reproduce this at will.  Because I have 16 drives, when one
>>> fails, I need to find it.  I pull the sata cable for a drive, determine if
>>> it's the drive in question, if not, reconnect, "ONLINE" it and wait for
>>> resilver to stop... usually only a minute or two.
>>>
>>> ... if I do this 4 to 6 odd times to find a drive (I can tell, in general,
>>> that a drive is part of the SAS controller or the SATA controllers... so
>>> I'm only looking among 8, ever) ... then I "REPLACE" the problem drive.
>>> More often than not, the a scrub will find a few problems.  In fact, it
>>> appears that the most recent scrub is an example:
>>>
>>> [1:7:306]dgilbert at vr:~> zpool status
>>>   pool: vr1
>>>  state: ONLINE
>>>   scan: scrub repaired 32K in 47h16m with 0 errors on Mon Apr  1 23:12:03
>>> 2019
>>> config:
>>>
>>>         NAME            STATE     READ WRITE CKSUM
>>>         vr1             ONLINE       0     0     0
>>>           raidz2-0      ONLINE       0     0     0
>>>             gpt/v1-d0   ONLINE       0     0     0
>>>             gpt/v1-d1   ONLINE       0     0     0
>>>             gpt/v1-d2   ONLINE       0     0     0
>>>             gpt/v1-d3   ONLINE       0     0     0
>>>             gpt/v1-d4   ONLINE       0     0     0
>>>             gpt/v1-d5   ONLINE       0     0     0
>>>             gpt/v1-d6   ONLINE       0     0     0
>>>             gpt/v1-d7   ONLINE       0     0     0
>>>           raidz2-2      ONLINE       0     0     0
>>>             gpt/v1-e0c  ONLINE       0     0     0
>>>             gpt/v1-e1b  ONLINE       0     0     0
>>>             gpt/v1-e2b  ONLINE       0     0     0
>>>             gpt/v1-e3b  ONLINE       0     0     0
>>>             gpt/v1-e4b  ONLINE       0     0     0
>>>             gpt/v1-e5a  ONLINE       0     0     0
>>>             gpt/v1-e6a  ONLINE       0     0     0
>>>             gpt/v1-e7c  ONLINE       0     0     0
>>>         logs
>>>           gpt/vr1log    ONLINE       0     0     0
>>>         cache
>>>           gpt/vr1cache  ONLINE       0     0     0
>>>
>>> errors: No known data errors
>>>
>>> ... it doesn't say it now, but there were 5 CKSUM errors on one of the
>>> drives that I had trial-removed (and not on the one replaced).
>>> _______________________________________________
>> That is EXACTLY what I'm seeing; the "OFFLINE'd" drive is the one that,
>> after a scrub, comes up with the checksum errors.  It does *not* flag
>> any errors during the resilver and the drives *not* taken offline do not
>> (ever) show checksum errors either.
>>
>> Interestingly enough you have 19.00.00.00 firmware on your card as well
>> -- which is what was on mine.
>>
>> I have flashed my card forward to 20.00.07.00 -- we'll see if it still
>> does it when I do the next swap of the backup set.
> Verrrrrryyyyy interesting.
>
> This drive was last written/read under 19.00.00.00.  Yesterday I swapped
> it back in.  Note that right now I am running:
>
> mps0: <Avago Technologies (LSI) SAS2116> port 0xc000-0xc0ff mem
> 0xfbb3c000-0xfbb3ffff,0xfbb40000-0xfbb7ffff irq 30 at device 0.0 on pci3
> mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
> mps0: IOCCapabilities:
> 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
>
> And, after the scrub completed overnight....
>
> [karl at NewFS ~]$ zpool status backup
>   pool: backup
>  state: DEGRADED
> status: One or more devices has experienced an unrecoverable error.  An
>         attempt was made to correct the error.  Applications are unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
>         using 'zpool clear' or replace the device with 'zpool replace'.
>    see: http://illumos.org/msg/ZFS-8000-9P
>   scan: scrub repaired 4K in 0 days 06:30:55 with 0 errors on Sat Apr 13
> 01:42:04 2019
> config:
>
>         NAME                     STATE     READ WRITE CKSUM
>         backup                   DEGRADED     0     0     0
>           mirror-0               DEGRADED     0     0     0
>             gpt/backup61.eli     ONLINE       0     0     0
>             2650799076683778414  OFFLINE      0     0     0  was
> /dev/gpt/backup62-1.eli
>             gpt/backup62-2.eli   ONLINE       0     0     1
>
> errors: No known data errors
>
> The OTHER interesting data point is that the resilver *also* posted one
> checksum error, which I cleared before doing the scrub.  Both on the
> 62-2 device.
>
> That would be one block in both cases.  The expected was several (maybe
> a half-dozen) checksum errors on 19.00.00.00 during the scrub but *zero*
> during the resilver.
>
> The unit which was put *into* the vault and is now offline was written
> and scrubbed under 20.00.07.00.  The behavior change certainly implies
> that there are some differences and again, none of these OFFLINE state
> situations are uncontrolled -- in each case the drive is taken offline
> intentionally, the geli provider is detached and then the unit has
> "camcontrol standby" executed against it before it is yanked, so in
> theory at least there should be no way for a unflushed but write-cached
> block to be lost or damaged.
>
> I smell a rat but it may well be in the 19.00.00.00 firmware on the card...

I can confirm that 20.00.07.00 does *not* stop this.

The previous write/scrub on this device was on 20.00.07.00.  It was
swapped back in from the vault yesterday, resilvered without incident,
but a scrub says....

root at NewFS:/home/karl # zpool status backup
  pool: backup
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on Sat Apr
20 08:45:09 2019
config:

        NAME                      STATE     READ WRITE CKSUM
        backup                    DEGRADED     0     0     0
          mirror-0                DEGRADED     0     0     0
            gpt/backup61.eli      ONLINE       0     0     0
            gpt/backup62-1.eli    ONLINE       0     0    47
            13282812295755460479  OFFLINE      0     0     0  was
/dev/gpt/backup62-2.eli

errors: No known data errors

So this is firmware-invariant (at least between 19.00.00.00 and
20.00.07.00); the issue persists.

Again, in my instance these devices are never removed "unsolicited" so
there can't be (or at least shouldn't be able to) unflushed data in the
device or kernel cache.  The procedure is and remains:

zpool offline .....
geli detach .....
camcontrol standby ...

Wait a few seconds for the spindle to spin down.

Remove disk.

Then of course on the other side after insertion and the kernel has
reported "finding" the device:

geli attach ...
zpool online ....

Wait...

If this is a boogered TXG that's held in the metadata for the
"offline"'d device (maybe "off by one"?) that's potentially bad in that
if there is an unknown failure in the other mirror component the
resilver will complete but data has been irrevocably destroyed.

Granted, this is a very low probability scenario (the area where the bad
checksums are has to be where the corruption hits, and it has to happen
between the resilver and access to that data.)  Those are long odds but
nonetheless a window of "you're hosed" does appear to exist.

-- 
Karl Denninger
karl at denninger.net <mailto:karl at denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4897 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20190420/e8ea732f/attachment.bin>