DELL SAS5/E Controller bug

Stephane LAPIE stephane.lapie at darkbsd.org
Thu Jan 21 07:22:00 UTC 2010


John Baldwin wrote:
> On Wednesday 20 January 2010 10:09:43 am Stephane LAPIE wrote:
>> John Baldwin wrote:
>>> On Wednesday 20 January 2010 4:30:52 am Stephane LAPIE wrote:
>>>> Hello list,
>>>>
>>>> Basically I'm experiencing the same problem as described here :
>>>> https://forums.freebsd.org/showthread.php?t=9407 (linking for reference)
>>>>
>>>> Drives disconnections are not recognized instantly, and instead I get
>>>> the following dmesg entries :
>>>> mpt0: mpt_cam_event: 0x16
>>>> mpt0: mpt_cam_event: 0x16
>>>>
>>>> (Sometimes I also get "mpt0: mpt_cam_event: 0x12" events)
>>>>
>>>> This is really crippling as this litterally paralyzes the ZFS pool until
>>>> the controller finally comes to its senses (...or until a disk gets
>>>> replugged in, which provokes a flush of all the buffered failed SCSI
>>>> requests).
>>>>
>>>> Hardware is recognized as :
>>>> mpt0 at pci0:6:8:0:	class=0x010000 card=0x1f041028 chip=0x00541000 rev=0x01
>>>> hdr=0x00
>>>>     vendor = 'LSI Logic (Was: Symbios Logic, NCR)'
>>>>     device = 'SAS 3000 series, 8-port with 1068 -StorPort'
>>>>     class = mass storage
>>>>     subclass = SCSI
>>>>
>>>> Did anyone else experience this, or find a proper work-around ?
>>> Invoke 'camcontrol rescan' after removing a drive.  mptutil(8) does the 
>>> equivalent when adding and removing volumes to make up for the driver not 
>>> automatically rescanning.
>> I already tried reset/rescan via camcontrol, but after removing a drive, 
>> the process freezes (process status "D", Ctrl+T in terminal shows it's 
>> in a "cbwait" state, it can't be bg'ed). I did not wait for a hardware 
>> timeout, I tried replugging the drive, which released the ZFS and 
>> camcontrol locks.
>>
>>
>> Also, I tried poking around with mptutil and could obtain the following 
>> information, if it can be of any help :
>>
>> freebsd-r610# mptutil -u 0 show adapter
>> mpt0 Adapter:
>>         Board Name: SAS5e
>>     Board Assembly:
>>          Chip Name: C1068
>>      Chip Revision: UNUSED
>>        RAID Levels: none
>> mptutil: Reading config page header failed: Invalid configuration page
>>
>> (The above error message should be normal since this is not a RAID 
>> controller, though a bit jarring)
> 
> This patch should fix that:
> 
> Index: mpt_show.c
> ===================================================================
> --- mpt_show.c	(revision 202640)
> +++ mpt_show.c	(working copy)
> @@ -78,6 +78,7 @@
>  	CONFIG_PAGE_MANUFACTURING_0 *man0;
>  	CONFIG_PAGE_IOC_2 *ioc2;
>  	CONFIG_PAGE_IOC_6 *ioc6;
> +	U16 IOCStatus;
>  	int fd, comma;
>  
>  	if (ac != 1) {
> @@ -108,7 +109,7 @@
>  
>  	free(man0);
>  
> -	ioc2 = mpt_read_ioc_page(fd, 2, NULL);
> +	ioc2 = mpt_read_ioc_page(fd, 2, &IOCStatus);
>  	if (ioc2 != NULL) {
>  		printf("      RAID Levels:");
>  		comma = 0;
> @@ -151,9 +152,10 @@
>  			printf(" none");
>  		printf("\n");
>  		free(ioc2);
> -	}
> +	} else if (IOCStatus != MPI_IOCSTATUS_CONFIG_INVALID_PAGE)
> +		warnx("mpt_read_ioc_page(2): %s", mpt_ioc_status(IOCStatus));
>  
> -	ioc6 = mpt_read_ioc_page(fd, 6, NULL);
> +	ioc6 = mpt_read_ioc_page(fd, 6, &IOCStatus);
>  	if (ioc6 != NULL) {
>  		display_stripe_map("    RAID0 Stripes",
>  		    ioc6->SupportedStripeSizeMapIS);
> @@ -172,7 +174,8 @@
>  			printf("-%u", ioc6->MaxDrivesIME);
>  		printf("\n");
>  		free(ioc6);
> -	}
> +	} else if (IOCStatus != MPI_IOCSTATUS_CONFIG_INVALID_PAGE)
> +		warnx("mpt_read_ioc_page(2): %s", mpt_ioc_status(IOCStatus));
>  
>  	/* TODO: Add an ioctl to fetch IOC_FACTS and print firmware version. */
>  
> 
>> However, the following is a bit disturbing :
>>
>> freebsd-r610# mptutil -u 0 show drives
>> mpt0 Physical Drives:
>>   da0 (  932G) ONLINE <SEAGATE ST31000640SS MS04> SAS bus 0 id 0
>>   da1 (  932G) ONLINE <SEAGATE ST31000640SS MS04> SAS bus 0 id 1
>>   da2 (  932G) ONLINE <SEAGATE ST31000640SS MS04> SAS bus 0 id 2
>>   da3 (  932G) ONLINE <SEAGATE ST31000640SS MS04> SAS bus 0 id 3
>>   da4 (  932G) ONLINE <SEAGATE ST31000640SS MS04> SAS bus 0 id 4
>>   da5 (  932G) ONLINE <SEAGATE ST31000640SS MS04> SAS bus 0 id 5
>>   da6 (  932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 6
>>   da7 (  932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 7
>>   da8 (  932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 8
>>   da9 (  932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 9
>> da10 (  932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 10
>> da11 (  932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 11
>> da12 (  932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 12
>> da13 (  932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 13
>> da14 (  932G) ONLINE <SEAGATE ST31000640SS MS05> SAS bus 0 id 14
>> da15 (  136G) ONLINE <Dell VIRTUAL DISK 1028> SAS bus 0 id 0
>>
>> The above listing seems weird, as da15 should belong to mpt1.
> 
> Agreed.  I specifically ask that CAM only return results for devices on bus 0
> of mptX.  Before when I debugged this I used gdb and set a breakpoint in
> mpt_fetch_disks() so I could examine the structures that CAM returned.  This
> is the code that identifies mptX vs mpt<any>:
> 
> 		/* Match mptX bus 0. */
> 		ccb.cdm.patterns[0].type = DEV_MATCH_BUS;
> 		b = &ccb.cdm.patterns[0].pattern.bus_pattern;
> 		snprintf(b->dev_name, sizeof(b->dev_name), "mpt");
> 		b->unit_number = mpt_unit;
> 		b->bus_id = 0;
> 		b->flags = BUS_MATCH_NAME | BUS_MATCH_UNIT | BUS_MATCH_BUS_ID;
> 
> 'mpt_unit' is a global variable that is set to the value of the 'u'
> parameter.
> 
>> freebsd-r610# mptutil -u 1 show drives
>> mptutil: mpt_fetch_disks got wrong CAM matches
>> mpt1 Physical Drives:
>>     0 (  137G) ONLINE <FUJITSU MBE2147RC D701> SAS bus 0 id 1
>>     1 (  137G) ONLINE <FUJITSU MBE2147RC D701> SAS bus 0 id 9
> 
> Similarly I would use gdb to exmaine the reply from CAM here to see why
> it got 'wrong CAM matches'.  The code expects the first match to match
> the bus and the next N matches should be 'daX' devices.
> 

I just applied your patch to mptutil source, which now returns :

freebsd-r610# mptutil show adapter
mpt0 Adapter:
       Board Name: SAS5e
   Board Assembly:
	Chip Name: C1068
    Chip Revision: UNUSED
      RAID Levels: none
mptutil: mpt_read_ioc_page(2): Invalid configuration page

I will give a try on the gdb thing once I get a chance of installing the
source tree on this test machine.


Also, I pasted the dmesg trace of trying to remove da0 and da6 and
trying to have the system register the removal via a "camcontrol rescan 0" :

-> Unplugging "da0" and "da6" :
mpt0: mpt_cam_event: 0x16
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x16
mpt0: mpt_cam_event: 0x16
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x16

-> Then running "camcontrol rescan 0" (which leaves "cbwait" state and
finishes at 187s real time)
mpt0: request 0xffffff80005bcea0:5936 timed out for ccb
0xffffff00032d4000 (req->ccb 0xffffff00032d4000)
mpt0: attempting to abort req 0xffffff80005bcea0:5936 function 0
mpt0: mpt_wait_req(1) timed out
mpt0: mpt_recover_commands: abort timed-out. Resetting controller
mpt0: mpt_cam_event: 0x0
mpt0: completing timedout/aborted req 0xffffff80005bcea0:5936
mpt0: mpt_cam_event: 0x16
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x16
(da0:mpt0:0:0:0): lost device
(da0:mpt0:0:0:0): Synchronize cache failed, status == 0x4a, scsi status
== 0x0
(da0:mpt0:0:0:0): removing device entry
(da6:mpt0:0:6:0): lost device
(da6:mpt0:0:6:0): Synchronize cache failed, status == 0x4a, scsi status
== 0x0
(da6:mpt0:0:6:0): removing device entry

-> Then replugging the drive "da0" :
mpt0: mpt_cam_event: 0x16
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x16

-> Then running "camcontrol rescan 0" (which responds in a few seconds
time, between 7~10s) :
da0 at mpt0 bus 0 target 6 lun 0
da0: <SEAGATE ST31000640SS MS05> Fixed Direct Access SCSI-5 device
da0: 300.000MB/s transfers
da0: Command Queueing enabled
da0: 953869MB (1953525168 512 byte sectors: 255H 63S/T 121601C)

-> Then replugging the drive "da6" :
mpt0: mpt_cam_event: 0x16
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x16

-> Then running "camcontrol rescan 0" (which responds in a few seconds
time, between 7~10s) :
da6 at mpt0 bus 0 target 6 lun 0
da6: <SEAGATE ST31000640SS MS05> Fixed Direct Access SCSI-5 device
da6: 300.000MB/s transfers
da6: Command Queueing enabled
da6: 953869MB (1953525168 512 byte sectors: 255H 63S/T 121601C)

Is there any documentation or hint as to what those mpt_cam_event are ?
I could whip myself a quick patch to at least change the display so one
would figure what these are.

It feels like the 0x12 and 0x16 have to be handled to invalidate the
device that has been unplugged so the next request won't timeout but
fail directly.

This is actually not as crippling as I initially thought for use with
ZFS, but waiting 190s (3x60s timeouts + normal execution of camcontrol
?) sounds a bit overboard.


On a separate note (and this is not a real problem), my test case of
plugging out two drives from a 2xraidz1 pool made me notice a little
quirk that has the system reusing the first free daX slot upon
replugging a drive and thus getting another device name. I guess that
could probably be worked around with either of the two methods :
- Binding bus:target:lun to a fixed daX number
- zpool export/importing the pool upon disk failure (though this is a
bit constraining)

Thanks again for your time,
-- 
Stephane LAPIE, EPITA SRS, Promo 2005
"Even when they have digital readouts, I can't understand them."
--MegaTokyo

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 261 bytes
Desc: OpenPGP digital signature
Url : http://lists.freebsd.org/pipermail/freebsd-hardware/attachments/20100121/42878f67/signature.pgp


More information about the freebsd-hardware mailing list