RELENG_8 / mpt / zpool Errors
Tim Gustafson
tjg at soe.ucsc.edu
Tue Sep 6 23:23:09 UTC 2011
Hi all,
I'm running RELENG_8:
----------
root at bsd-03: uname -a
FreeBSD bsd-03 8.2-STABLE FreeBSD 8.2-STABLE #0: Mon Aug 22 14:58:58 PDT 2011 root at bsd-03:/usr/obj/usr/src/sys/GENERIC amd64
----------
We've got an MPT controller installed with 32 drives attached:
----------
root at bsd-03: dmesg | grep mpt
mpt0: <LSILogic SAS/SATA Adapter> port 0xec00-0xecff mem 0xef3fc000-0xef3fffff,0xef3e0000-0xef3effff irq 32 at device 0.0 on pci3
mpt0: [ITHREAD]
mpt0: MPI Version=1.5.19.0
ses0 at mpt0 bus 0 scbus1 target 32 lun 0
ses1 at mpt0 bus 0 scbus1 target 33 lun 0
da5 at mpt0 bus 0 scbus1 target 0 lun 0
.....SNIP.....
da36 at mpt0 bus 0 scbus1 target 31 lun 0
----------
We have a zpool on those drives configured into one large zfs file system:
----------
root at bsd-03: zpool status
pool: jails
state: ONLINE
scan: resilvered 5.51M in 0h12m with 0 errors on Tue Sep 6 15:10:23 2011
config:
NAME STATE READ WRITE CKSUM
jails ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
da5 ONLINE 0 0 0
da6 ONLINE 0 0 0
da7 ONLINE 0 0 0
da8 ONLINE 0 0 0
da9 ONLINE 0 0 0
da10 ONLINE 0 0 0
da11 ONLINE 0 0 0
da12 ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
da13 ONLINE 0 0 0
da14 ONLINE 0 0 0
da15 ONLINE 0 0 0
da16 ONLINE 0 0 0
da17 ONLINE 0 0 0
da18 ONLINE 0 0 0
da19 ONLINE 0 0 0
da20 ONLINE 0 0 0
raidz1-2 ONLINE 0 0 0
da21 ONLINE 0 0 0
da22 ONLINE 0 0 0
da23 ONLINE 0 0 0
da24 ONLINE 0 0 0
da25 ONLINE 0 0 0
da26 ONLINE 0 0 0
da27 ONLINE 0 0 0
da28 ONLINE 0 0 0
raidz1-3 ONLINE 0 0 0
da29 ONLINE 0 0 0
da30 ONLINE 0 0 0
da31 ONLINE 0 0 0
da32 ONLINE 0 0 0
da33 ONLINE 0 0 0
da34 ONLINE 0 0 0
da35 ONLINE 0 0 0
da36 ONLINE 0 0 0
errors: No known data errors
----------
We're seeing some occasional oddness. About every two weeks it seems the controller temporarily loses connectivity with the drives and the zpool goes a bit bonkers and reports a dozen or so corrupted files. A "zpool scrub" goes through and reports that everything's been fixed and everything seems OK again (although I have not 100% confirmed that there is no file corruption yet, but I'm giving ZFS's check-summing logic the benefit of the doubt here).
When we have problems, it tends to be accompanied by the following in my dmesg:
----------
(da20:mpt0:0:15:0): READ(10). CDB: 28 0 90 b0 6b dd 0 0 9 0
(da20:mpt0:0:15:0): CAM status: SCSI Status Error
(da20:mpt0:0:15:0): SCSI status: Check Condition
(da20:mpt0:0:15:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da17:mpt0:0:12:0): READ(10). CDB: 28 0 90 b0 6c e 0 0 2 0
(da17:mpt0:0:12:0): CAM status: SCSI Status Error
(da17:mpt0:0:12:0): SCSI status: Check Condition
(da17:mpt0:0:12:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
mpt0: request 0xffffff800080b520:10990 timed out for ccb 0xffffff013227b000 (req->ccb 0xffffff013227b000)
mpt0: attempting to abort req 0xffffff800080b520:10990 function 0
mpt0: mpt_wait_req(1) timed out
mpt0: mpt_recover_commands: abort timed-out. Resetting controller
mpt0: mpt_cam_event: 0x0
mpt0: mpt_cam_event: 0x0
mpt0: completing timedout/aborted req 0xffffff800080b520:10990
mpt0: mpt_cam_event: 0x1b
mpt0: mpt_cam_event: 0x1b
mpt0: SAS discovery error: Port: 0x00 Status: 0x00004002
mpt0: SAS discovery error: Port: 0x00 Status: 0x00000010
mpt0: request 0xffffff8000811310:54341 timed out for ccb 0xffffff000897a000 (req->ccb 0xffffff000897a000)
mpt0: attempting to abort req 0xffffff8000811310:54341 function 0
mpt0: mpt_wait_req(1) timed out
mpt0: mpt_recover_commands: abort timed-out. Resetting controller
mpt0: mpt_cam_event: 0x0
mpt0: completing timedout/aborted req 0xffffff8000811310:54341
mpt0: mpt_cam_event: 0x1b
mpt0: mpt_cam_event: 0x1b
----------
So, is this an OS/driver issue? Is it a bad controller? Bad cables? Bad disks?
As always, any help is greatly appreciated. Thanks!
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Tim Gustafson tjg at soe.ucsc.edu
Baskin School of Engineering 831-459-5354
UC Santa Cruz Baskin Engineering 317B
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
More information about the freebsd-current
mailing list