DELL SAS5/E Controller bug
John Baldwin
jhb at freebsd.org
Thu Jan 21 13:12:54 UTC 2010
On Thursday 21 January 2010 2:21:48 am Stephane LAPIE wrote:
> John Baldwin wrote:
> > On Wednesday 20 January 2010 10:09:43 am Stephane LAPIE wrote:
> >> John Baldwin wrote:
> >>> On Wednesday 20 January 2010 4:30:52 am Stephane LAPIE wrote:
> >>>> Hello list,
> >>>>
> >>>> Basically I'm experiencing the same problem as described here :
> >>>> https://forums.freebsd.org/showthread.php?t=9407 (linking for
reference)
> >>>>
> >>>> Drives disconnections are not recognized instantly, and instead I get
> >>>> the following dmesg entries :
> >>>> mpt0: mpt_cam_event: 0x16
> >>>> mpt0: mpt_cam_event: 0x16
> >>>>
> >>>> (Sometimes I also get "mpt0: mpt_cam_event: 0x12" events)
> >>>>
> >>>> This is really crippling as this litterally paralyzes the ZFS pool
until
> >>>> the controller finally comes to its senses (...or until a disk gets
> >>>> replugged in, which provokes a flush of all the buffered failed SCSI
> >>>> requests).
> >>>>
> >>>> Hardware is recognized as :
> >>>> mpt0 at pci0:6:8:0: class=0x010000 card=0x1f041028 chip=0x00541000
rev=0x01
> >>>> hdr=0x00
> >>>> vendor = 'LSI Logic (Was: Symbios Logic, NCR)'
> >>>> device = 'SAS 3000 series, 8-port with 1068 -StorPort'
> >>>> class = mass storage
> >>>> subclass = SCSI
> >>>>
> >>>> Did anyone else experience this, or find a proper work-around ?
> >>> Invoke 'camcontrol rescan' after removing a drive. mptutil(8) does the
> >>> equivalent when adding and removing volumes to make up for the driver
not
> >>> automatically rescanning.
> >> I already tried reset/rescan via camcontrol, but after removing a drive,
> >> the process freezes (process status "D", Ctrl+T in terminal shows it's
> >> in a "cbwait" state, it can't be bg'ed). I did not wait for a hardware
> >> timeout, I tried replugging the drive, which released the ZFS and
> >> camcontrol locks.
> >>
> >>
> >> Also, I tried poking around with mptutil and could obtain the following
> >> information, if it can be of any help :
> >>
> >> freebsd-r610# mptutil -u 0 show adapter
> >> mpt0 Adapter:
> >> Board Name: SAS5e
> >> Board Assembly:
> >> Chip Name: C1068
> >> Chip Revision: UNUSED
> >> RAID Levels: none
> >> mptutil: Reading config page header failed: Invalid configuration page
> >>
> >> (The above error message should be normal since this is not a RAID
> >> controller, though a bit jarring)
> >
> > This patch should fix that:
> >
> > Index: mpt_show.c
> > ===================================================================
> > --- mpt_show.c (revision 202640)
> > +++ mpt_show.c (working copy)
> > @@ -78,6 +78,7 @@
> > CONFIG_PAGE_MANUFACTURING_0 *man0;
> > CONFIG_PAGE_IOC_2 *ioc2;
> > CONFIG_PAGE_IOC_6 *ioc6;
> > + U16 IOCStatus;
> > int fd, comma;
> >
> > if (ac != 1) {
> > @@ -108,7 +109,7 @@
> >
> > free(man0);
> >
> > - ioc2 = mpt_read_ioc_page(fd, 2, NULL);
> > + ioc2 = mpt_read_ioc_page(fd, 2, &IOCStatus);
> > if (ioc2 != NULL) {
> > printf(" RAID Levels:");
> > comma = 0;
> > @@ -151,9 +152,10 @@
> > printf(" none");
> > printf("\n");
> > free(ioc2);
> > - }
> > + } else if (IOCStatus != MPI_IOCSTATUS_CONFIG_INVALID_PAGE)
> > + warnx("mpt_read_ioc_page(2): %s", mpt_ioc_status(IOCStatus));
> >
> > - ioc6 = mpt_read_ioc_page(fd, 6, NULL);
> > + ioc6 = mpt_read_ioc_page(fd, 6, &IOCStatus);
> > if (ioc6 != NULL) {
> > display_stripe_map(" RAID0 Stripes",
> > ioc6->SupportedStripeSizeMapIS);
> > @@ -172,7 +174,8 @@
> > printf("-%u", ioc6->MaxDrivesIME);
> > printf("\n");
> > free(ioc6);
> > - }
> > + } else if (IOCStatus != MPI_IOCSTATUS_CONFIG_INVALID_PAGE)
> > + warnx("mpt_read_ioc_page(2): %s", mpt_ioc_status(IOCStatus));
> >
> > /* TODO: Add an ioctl to fetch IOC_FACTS and print firmware version. */
> >
> >
> >> However, the following is a bit disturbing :
> >>
> >> freebsd-r610# mptutil -u 0 show drives
> >> mpt0 Physical Drives:
> >> da0 ( 932G) ONLINE <SEAGATE ST31000640SS MS04> SAS bus 0 id 0
> >> da1 ( 932G) ONLINE <SEAGATE ST31000640SS MS04> SAS bus 0 id 1
> >> da2 ( 932G) ONLINE <SEAGATE ST31000640SS MS04> SAS bus 0 id 2
> >> da3 ( 932G) ONLINE <SEAGATE ST31000640SS MS04> SAS bus 0 id 3
> >> da4 ( 932G) ONLINE <SEAGATE ST31000640SS MS04> SAS bus 0 id 4
> >> da5 ( 932G) ONLINE <SEAGATE ST31000640SS MS04> SAS bus 0 id 5
> >> da6 ( 932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 6
> >> da7 ( 932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 7
> >> da8 ( 932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 8
> >> da9 ( 932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 9
> >> da10 ( 932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 10
> >> da11 ( 932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 11
> >> da12 ( 932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 12
> >> da13 ( 932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 13
> >> da14 ( 932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 14
> >> da15 ( 136G) ONLINE <Dell VIRTUAL DISK 1028> SAS bus 0 id 0
> >>
> >> The above listing seems weird, as da15 should belong to mpt1.
> >
> > Agreed. I specifically ask that CAM only return results for devices on
bus 0
> > of mptX. Before when I debugged this I used gdb and set a breakpoint in
> > mpt_fetch_disks() so I could examine the structures that CAM returned.
This
> > is the code that identifies mptX vs mpt<any>:
> >
> > /* Match mptX bus 0. */
> > ccb.cdm.patterns[0].type = DEV_MATCH_BUS;
> > b = &ccb.cdm.patterns[0].pattern.bus_pattern;
> > snprintf(b->dev_name, sizeof(b->dev_name), "mpt");
> > b->unit_number = mpt_unit;
> > b->bus_id = 0;
> > b->flags = BUS_MATCH_NAME | BUS_MATCH_UNIT | BUS_MATCH_BUS_ID;
> >
> > 'mpt_unit' is a global variable that is set to the value of the 'u'
> > parameter.
> >
> >> freebsd-r610# mptutil -u 1 show drives
> >> mptutil: mpt_fetch_disks got wrong CAM matches
> >> mpt1 Physical Drives:
> >> 0 ( 137G) ONLINE <FUJITSU MBE2147RC D701> SAS bus 0 id 1
> >> 1 ( 137G) ONLINE <FUJITSU MBE2147RC D701> SAS bus 0 id 9
> >
> > Similarly I would use gdb to exmaine the reply from CAM here to see why
> > it got 'wrong CAM matches'. The code expects the first match to match
> > the bus and the next N matches should be 'daX' devices.
> >
>
> I just applied your patch to mptutil source, which now returns :
>
> freebsd-r610# mptutil show adapter
> mpt0 Adapter:
> Board Name: SAS5e
> Board Assembly:
> Chip Name: C1068
> Chip Revision: UNUSED
> RAID Levels: none
> mptutil: mpt_read_ioc_page(2): Invalid configuration page
Gah, that should be the case that I ignore. Can you replace the second
warnx() call I added with this:
warnx("mpt_read_ioc_page(6): %s (%x)", mpt_ioc_status(IOCStatus),
IOCStatus);
> I will give a try on the gdb thing once I get a chance of installing the
> source tree on this test machine.
>
>
> Also, I pasted the dmesg trace of trying to remove da0 and da6 and
> trying to have the system register the removal via a "camcontrol rescan 0" :
>
> -> Unplugging "da0" and "da6" :
> mpt0: mpt_cam_event: 0x16
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x16
> mpt0: mpt_cam_event: 0x16
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x16
>
> -> Then running "camcontrol rescan 0" (which leaves "cbwait" state and
> finishes at 187s real time)
> mpt0: request 0xffffff80005bcea0:5936 timed out for ccb
> 0xffffff00032d4000 (req->ccb 0xffffff00032d4000)
> mpt0: attempting to abort req 0xffffff80005bcea0:5936 function 0
> mpt0: mpt_wait_req(1) timed out
> mpt0: mpt_recover_commands: abort timed-out. Resetting controller
> mpt0: mpt_cam_event: 0x0
> mpt0: completing timedout/aborted req 0xffffff80005bcea0:5936
> mpt0: mpt_cam_event: 0x16
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x16
> (da0:mpt0:0:0:0): lost device
> (da0:mpt0:0:0:0): Synchronize cache failed, status == 0x4a, scsi status
> == 0x0
> (da0:mpt0:0:0:0): removing device entry
> (da6:mpt0:0:6:0): lost device
> (da6:mpt0:0:6:0): Synchronize cache failed, status == 0x4a, scsi status
> == 0x0
> (da6:mpt0:0:6:0): removing device entry
>
> -> Then replugging the drive "da0" :
> mpt0: mpt_cam_event: 0x16
> mpt0: mpt_cam_event: 0x12
> mpt0: mpt_cam_event: 0x16
I know that the rescan after removing a device is a bit messy (lots of
messages before daX actually goes away), but I don't recall it taking such a
long time.
> Is there any documentation or hint as to what those mpt_cam_event are ?
> I could whip myself a quick patch to at least change the display so one
> would figure what these are.
>
> It feels like the 0x12 and 0x16 have to be handled to invalidate the
> device that has been unplugged so the next request won't timeout but
> fail directly.
The documentation is not public. The 0x12 and 0x16 messages are events that
I have seen. You can try talking to scottl@ as he has access to the docs.
--
John Baldwin
More information about the freebsd-hardware
mailing list