lsi
    Aurelien "beorn" ROUGEMONT 
    beorn at binaries.fr
       
    Fri Mar 22 09:06:22 UTC 2019
    
    
  
Hi the list,
I have been using FreeBSD at home and in production for years and today
i stumbled upon a question i could not answer.
Context
-----------------------------------------
I'm building a backup server on a server with this HBA :
3:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] (rev 05)
    Subsystem: LSI Logic / Symbios Logic MegaRAID SAS 9271-8i
    Flags: bus master, fast devsel, latency 0, IRQ 34
    I/O ports at e000
    Memory at fb160000 (64-bit, non-prefetchable)
    Memory at fb100000 (64-bit, non-prefetchable)
    Expansion ROM at fb140000 [disabled]
    Capabilities: [50] Power Management version 3
    Capabilities: [68] Express Endpoint, MSI 00
    Capabilities: [d0] Vital Product Data
    Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+
    Capabilities: [c0] MSI-X: Enable+ Count=16 Masked-
    Capabilities: [100] Advanced Error Reporting
    Capabilities: [1e0] Secondary PCI Express <?>
    Capabilities: [1c0] Power Budgeting <?>
    Capabilities: [190] Dynamic Power Allocation <?>
    Capabilities: [148] Alternative Routing-ID Interpretation (ARI)
After pushing the server I/Os to its limits the server had a very nasty 
crash.
It happens very seldomly, in roughly 10 years among the petabytes of
storage servers i kept running it always was hardware or driver/firmware
related.
    |Shortening read at 4292967280 from 16 to 15 ZFS: i/o error - all
    block copies unavailable ZFS: can't read object set for dataset 52
    ZFS: can't open root filesystem gptzfsboot: failed to mount default
    pool zroot|
After simply reinstalling (for nothing) the bootloaders, checking the
partition tables, i went digging a lot in the FreeBSD codebase. I found
that it was a ZFS problem.
The nasty crash was indeed due to ZFS  data corruption. Hence the
checksum errors while scrubing the zpool on a rescue network boot image :
      pool: zroot                                                                                                                                                                                                       
     state: ONLINE                                                                     
    status: One or more devices has experienced an unrecoverable error.  An            
            attempt was made to correct the error.  Applications are unaffected.       
    action: Determine if the device needs to be replaced, and clear the errors         
            using 'zpool clear' or replace the device with 'zpool replace'.            
       see: http://illumos.org/msg/ZFS-8000-9P                                         
      scan: scrub in progress since Fri Mar 15 15:15:25 2019                           
            49.6G scanned out of 1.65T at 109M/s, 4h15m to go                          
            677M repaired, 2.94% done                                                  
    config:                                                                            
            NAME              STATE     READ WRITE CKSUM                               
            zroot             ONLINE       0     0     0                               
              raidz2-0        ONLINE       0     0     0                               
                mfisyspd0p3   ONLINE       0     0 5.44K  (repairing)                  
                mfisyspd1p3   ONLINE       0     0 4.76K  (repairing)                  
                mfisyspd10p3  ONLINE       0     0 4.35K  (repairing)                  
                mfisyspd11p3  ONLINE       0     0 5.17K  (repairing)                  
                mfisyspd2p3   ONLINE       0     0 4.76K  (repairing)                  
                mfisyspd3p3   ONLINE       0     0 4.24K  (repairing)                  
                mfisyspd4p3   ONLINE       0     0 4.75K  (repairing)                  
                mfisyspd5p3   ONLINE       0     0 5.20K  (repairing)                  
                mfisyspd6p3   ONLINE       0     0 4.51K  (repairing)                  
                mfisyspd7p3   ONLINE       0     0 4.65K  (repairing)                  
                mfisyspd8p3   ONLINE       0     0 4.70K  (repairing)                  
                mfisyspd9p3   ONLINE       0     0 3.81K  (repairing)  
At this point the server was still unable to reboot. I've had to force
data re-copy with a dumb :
    mv /boot{,.dist}
    cp -pr /boot{.dist}
Which turned out to be fine.
Going further i finally killed for good the zpool. It took me some time
and i stumbled upon the mfi(4) and the mrsas(4) man pages and code.
     The mfi driver supports the following hardware:
     o   LSI MegaRAID SAS 1078
     o   LSI MegaRAID SAS 8408E
     o   LSI MegaRAID SAS 8480E
     o   LSI MegaRAID SAS 9240
     o   LSI MegaRAID SAS 9260
     o   Dell PERC5
     o   Dell PERC6
     o   IBM ServeRAID M1015 SAS/SATA
     o   IBM ServeRAID M1115 SAS/SATA
     o   IBM ServeRAID M5015 SAS/SATA
     o   IBM ServeRAID M5110 SAS/SATA
     o   IBM ServeRAID-MR10i
     o   Intel RAID Controller SRCSAS18E
     o   Intel RAID Controller SROMBSAS18E
     The mrsas driver supports the following hardware:
     [ Thunderbolt 6Gb/s MR controller ]
     o   LSI MegaRAID SAS 9265
     o   LSI MegaRAID SAS 9266
     o   LSI MegaRAID SAS 9267
     o   LSI MegaRAID SAS 9270
     o   LSI MegaRAID SAS 9271
     o   LSI MegaRAID SAS 9272
     o   LSI MegaRAID SAS 9285
     o   LSI MegaRAID SAS 9286
     o   DELL PERC H810
     o   DELL PERC H710/P
There was a detectoin priority problem
    hw.mfi.mrsas_enable=1
    
    
More information about the freebsd-current
mailing list