Repeated msgs & kernel panic w/ r246437 (Revamp the CAM enclosure services driver)

Kenneth D. Merry ken at freebsd.org
Tue Apr 23 14:18:48 UTC 2013


On Tue, Apr 23, 2013 at 11:09:42 +0300, Alexander Motin wrote:
> On 22.04.2013 06:00, John wrote:
> >Hi Folks,
> >
> >    After updating one of our servers to the latest stable image,
> >it appears that commit r246437 appears to be causing it to panic.
> >
> >The commit:
> >
> >http://svnweb.freebsd.org/base?view=revision&revision=246437
> >
> >What one of our servers looks like:
> >
> >http://people.freebsd.org/~jwd/zfsnfsserver.jpg
> >
> >The last known working commit:
> >
> >http://people.freebsd.org/~jwd/r246437/dmesg.r246431.clean.txt
> >
> >With commit r246437:
> >
> >http://people.freebsd.org/~jwd/r246437/dmesg.r246437.log.txt
> >
> >Note, most of the dmesg output is related to the ses devices. It
> >repeats itself multiple times before the panic.
> >
> >ses39: ses0,pass20: Element descriptor: '            '
> >ses39: ses0,pass20: SAS Expander: 24 Physses39:  phy 0: connector 255 
> >other 255
> >ses39:  phy 1: connector 255 other 255
> >ses39:  phy 2: connector 255 other 255
> >ses39:  phy 3: connector 255 other 255
> >ses39:  phy 4: connector 255 other 255
> >ses39:  phy 5: connector 255 other 255
> >ses39:  phy 6: connector 255 other 255
> >
> >etc, etc...
> 
> That is not my part of code, but I think it is just too verbose debug 
> messages, that should be hidden.

Yes, it is probably too verbose, especially on such a large system.

> >After just a few minutes, the system panics. A pair of images
> >of the screen (sorry, no serial console at this time):
> >
> >Panic: http://people.freebsd.org/~jwd/r246437/20130419_160143.jpg
> >
> >bt: http://people.freebsd.org/~jwd/r246437/20130419_110158.jpg
> 
> Despite that you are talking about "latest stable image", I believe your 
> kernel is not latest 9-STABLE. Your backtrace reminds me about locking 
> problems that should be already fixed from several sides. For example, 
> on present 9-STABLE ses_path_iter_devid_callback() doesn't call 
> xpt_create_path(), but calls xpt_create_path_unlocked() instead. If you 
> can reproduce the issue with latest 9-STABLE, please provide respective 
> information.

I agree.  I added the xpt_create_path_unlocked() call to fix a
panic with a stack trace just like the one above.  It looks like a problem
due to running r246437 exactly.

> >We are currently running a test to see if the fact that all our
> >shelves are dual-attached, allowing us to use geom multipath is
> >related. ie: we have disabled the 2nd HBA thus cutting the total
> >number of da & ses devices in half and thus not executing the
> >code in the commit that tracks duplicate ses devices.
> >
> >Note, if we disable both HBA devices and boot the system up it
> >does not panic or print out the repeated messages, but of course
> >we have no disks :-)
> >
> >I am unclear on the "connector 255 other 255" messages and have not
> >taken the time to look into them yet.
> >
> >I would appreciate any insights folks can provide.
> >
> >Many Thanks,
> >John
> >
> >ps: We've had to seriously increase the console buffer size to
> >capture the complete dmesg output...
> >
> >options   MSGBUF_SIZE=(32768*32)
> >
> >Can we delay starting the kernel daemon until after the system
> >is up and /var/log/messages is available?  Just a thought...
> 
> The goal of this code was to create persistent location-dependent names 
> for devices. It may be better to have them earlier.

Yes, I agree.

Ken
-- 
Kenneth Merry
ken at FreeBSD.ORG


More information about the freebsd-scsi mailing list