svn commit: r218277 - in stable/7/sys: kern sys

Mon Apr 18 16:56:30 UTC 2011

On Monday, April 18, 2011 7:36:57 am Andre Albsmeier wrote:
> On Fri, 15-Apr-2011 at 18:35:05 +0200, John Baldwin wrote:
> > On Friday, April 15, 2011 9:25:25 am Andre Albsmeier wrote:
> > > On Fri, 04-Feb-2011 at 14:44:59 +0000, John Baldwin wrote:
> > > > Author: jhb
> > > > Date: Fri Feb  4 14:44:59 2011
> > > > New Revision: 218277
> > > > URL: http://svn.freebsd.org/changeset/base/218277
> > > > 
> > > > Log:
> > > >   MFC 217075:
> > > >   Retire PCONFIG and leave the priority of thread0 alone when waiting for
> > > >   interrupt config hooks to execute.
> > > >   
> > > >   To preserve the KBI, I did not renumber priorities but simply removed
> > > >   PCONFIG.
> > > > 
> > > > Modified:
> > > >   stable/7/sys/kern/subr_autoconf.c
> > > >   stable/7/sys/sys/priority.h
> > > > Directory Properties:
> > > >   stable/7/sys/   (props changed)
> > > >   stable/7/sys/cddl/contrib/opensolaris/   (props changed)
> > > >   stable/7/sys/contrib/dev/acpica/   (props changed)
> > > >   stable/7/sys/contrib/pf/   (props changed)
> > > > 
> > > > Modified: stable/7/sys/kern/subr_autoconf.c
> > > > 
> > ==============================================================================
> > > > --- stable/7/sys/kern/subr_autoconf.c	Fri Feb  4 14:44:42 2011	
> > (r218276)
> > > > +++ stable/7/sys/kern/subr_autoconf.c	Fri Feb  4 14:44:59 2011	
> > (r218277)
> > > > @@ -108,7 +108,7 @@ run_interrupt_driven_config_hooks(dummy)
> > > >  	warned = 0;
> > > >  	while (!TAILQ_EMPTY(&intr_config_hook_list)) {
> > > >  		if (msleep(&intr_config_hook_list, &intr_config_hook_lock,
> > > > -		    PCONFIG, "conifhk", WARNING_INTERVAL_SECS * hz) ==
> > > > +		    0, "conifhk", WARNING_INTERVAL_SECS * hz) ==
> > > >  		    EWOULDBLOCK) {
> > > >  			mtx_unlock(&intr_config_hook_lock);
> > > >  			warned++;
> > > 
> > > 
> > > This broke several of my machines in a somewhat strange way:
> > > 
> > > After upgrading them (17) to a recent 7-STABLE (as of 2011-04-12)
> > > I noticed that some (4) of them didn't start. All 4 didn't find
> > > their boot device anymore. What they all got in common is:
> > > 
> > > - an Adaptec 2940 Ultra SCSI adapter
> > > - two SCSI harddisks (da0 and da1) of various brands
> > > - one SCSI CDROM drive (cd0)
> > > 
> > > To be exact, none of the three devices (da0, da1, cd0) were
> > > detected at all. Other machines with a similar configuration
> > > (2940 and da0/da1) but _without_ the CDROM drive didn't have
> > > any problems. So I simply removed the CDROM drives on the 4
> > > machines in question and they all booted again.
> > > 
> > > Today I decided to dig into this and after reverting(*) the
> > > above change, they worked with the CDROM again. I have cross-
> > > checked it 3 times. No idea what's happening here...
> > > 
> > > 	-Andre
> > > 
> > > (*) To be honest, I use this patch so I had to modify only one file:
> > > 
> > > --- sys/kern/subr_autoconf.c.ORI	2011-02-05 13:14:11.000000000 +0100
> > > +++ sys/kern/subr_autoconf.c	2011-04-15 14:34:31.000000000 +0200
> > > @@ -108,7 +108,7 @@
> > >  	warned = 0;
> > >  	while (!TAILQ_EMPTY(&intr_config_hook_list)) {
> > >  		if (msleep(&intr_config_hook_list, &intr_config_hook_lock,
> > > -		    0, "conifhk", WARNING_INTERVAL_SECS * hz) ==
> > > +		    PRI_MIN_KERN + 32, "conifhk", WARNING_INTERVAL_SECS * hz) ==
> > >  		    EWOULDBLOCK) {
> > >  			mtx_unlock(&intr_config_hook_lock);
> > >  			warned++;
> > 
> > Do you get any warnings about CAM timeouts, etc. when these probe?  A verbose 
> > dmesg might be nice to look at if possible.
> 
> OK, I have set up a machine for testing. In my other mail
> I was wrong saying that the pass devices appear when using
> the problematic kernel...
> 
> Here are the dmesgs:
> 
> - dmesg_bad is the original kernel as of Friday
> - dmesg_ok is the patched kernel (see above) as of Friday
> - dmesg.diff is the diff between both
> 
> If you want me to try something just tell me...

Hmmm, what if you make SCSI_DELAY larger?  Also, can you let it fail the
mount and drop into ddb and then get 'ps' output?

I think the CAM boot probe is broken a bit.  xpt_rescan_done() always calls
xpt_release_boot(), but we don't hold the boot for each bus added while
buses_config_done is 0, so it seems CAM only waits for at least one bus to
rescan before it lets the boot continue?  This seems wrong (i.e. one would
think it would let all the busses added before this point scan before
continuing).

However, in your dmesg, it starts to print out an announcement for a pass
device before it starts mounting root, so it seems that xpt is finishing too
early somehow.

-- 
John Baldwin