Firewire disk/tape access stopped working after recent CAM commit

Kenneth D. Merry ken at freebsd.org
Mon Jan 23 18:16:07 UTC 2012


On Sun, Jan 22, 2012 at 20:52:38 -0600, Richard Todd wrote:
> Hi.  I tried upgrading my amd64 10-CURRENT box to the most recent -CURRENT code
> and found that the new kernel couldn't find my two disks and tape drive that
> are on a Firewire bus.  All the USB and AHCI-attached hardware still showed
> up okay, it's just the Firewire stuff that failed to show up properly on boot.
> Spent today doing binary search to find the responsible commit and it looks
> to be this one: 
> 
>   r230000 | ken | 2012-01-11 18:41:48 -0600 (Wed, 11 Jan 2012) | 72 lines
> 
>   Fix a race condition in CAM peripheral free handling, locking
>   in the CAM XPT bus traversal code, and a number of other periph level
>   issues.
> 
> Not sure what in this commit triggers the problem, or why it just hits 
> Firewire and not the rest of the system.   I've built kernels both right
> before and right after the r230000 commit, with CAM debugging turned on real
> high on the firewire bus in question, bus 0 (hardwired to that number in
> device.hints, if that matters)
> 
>  options CAMDEBUG
>  options CAM_DEBUG_BUS=0
>  options CAM_DEBUG_TARGET=-1
>  options CAM_DEBUG_LUN=-1
>  options CAM_DEBUG_FLAGS=CAM_DEBUG_INFO|CAM_DEBUG_TRACE|CAM_DEBUG_CDB
> 
> and got dmesgs of both the "bad" (r230000) and "good" (pre-r230000) kernels,
> which I've put online at http://ln.servalan.com/rmtodd/bug1/dmesg.bad and
> http://ln.servalan.com/rmtodd/bug1/dmesg.good, respectively.  They're a bit
> lengthy, what with all that debug info.  Grepping out the info for one of
> the targets (disk 0, sbp0:0:0:0) and just looking at the lines for that one,
> we see that the "good" kernel does a lot more with that target, starting
> with the "(noperiph:sbp0:0:0:0): xpt_compile_path" bit, that the "bad"
> kernel doesn't do, as seen in the diff below. 
> 
> Not sure what's going on here, but if anyone has suggestions on more things
> I can test/debug code I can add to track this down further, let me know.

Thanks for testing this out, and for sending all of the debugging output!

If you can, please try the attached patch and see if it has any impact on
the problem.  There is a bug in that commit in that we shouldn't be
invalidating all LUNs on a target when we get a status of
CAM_DEV_NOT_THERE.

It may be that we need to do a more thorough audit of how various SIM
drivers are using the CAM_DEV_NOT_THERE status.

Thanks,

Ken
-- 
Kenneth Merry
ken at FreeBSD.ORG
-------------- next part --------------
==== //depot/users/kenm/FreeBSD-test2/sys/cam/cam_periph.c#7 - /usr/home/kenm/perforce4/kenm/FreeBSD-test2/sys/cam/cam_periph.c ====
*** /tmp/tmp.87992.13	Mon Jan 23 11:11:36 2012
--- /usr/home/kenm/perforce4/kenm/FreeBSD-test2/sys/cam/cam_periph.c	Mon Jan 23 10:53:13 2012
***************
*** 1864,1876 ****
  	case CAM_DEV_NOT_THERE:
  	{
  		struct cam_path *newpath;
  
  		error = ENXIO;
  		/* Should we do more if we can't create the path?? */
  		if (xpt_create_path(&newpath, periph,
  				    xpt_path_path_id(ccb->ccb_h.path),
  				    xpt_path_target_id(ccb->ccb_h.path),
! 				    CAM_LUN_WILDCARD) != CAM_REQ_CMP) 
  			break;
  
  		/*
--- 1864,1889 ----
  	case CAM_DEV_NOT_THERE:
  	{
  		struct cam_path *newpath;
+ 		lun_id_t lun_id;
  
  		error = ENXIO;
+ 
+ 		/*
+ 		 * For a selection timeout, we consider all of the LUNs on
+ 		 * the target to be gone.  If the status is CAM_DEV_NOT_THERE,
+ 		 * then we only get rid of the device(s) specified by the
+ 		 * path in the original CCB.
+ 		 */
+ 		if (status == CAM_DEV_NOT_THERE)
+ 			lun_id = xpt_path_lun_id(ccb->ccb_h.path);
+ 		else
+ 			lun_id = CAM_LUN_WILDCARD;
+ 
  		/* Should we do more if we can't create the path?? */
  		if (xpt_create_path(&newpath, periph,
  				    xpt_path_path_id(ccb->ccb_h.path),
  				    xpt_path_target_id(ccb->ccb_h.path),
! 				    lun_id) != CAM_REQ_CMP) 
  			break;
  
  		/*


More information about the freebsd-current mailing list