DANGER WILL ROBINSON! SERIOUS problem with current 5.4-PRERELEASE - FURTHER UPDATE

Karl Denninger karl at denninger.net
Tue Apr 5 18:25:17 PDT 2005


On Thu, Mar 31, 2005 at 11:06:08AM -0600, Karl Denninger wrote:
> On Thu, Mar 31, 2005 at 12:02:20PM -0500, Matthew N. Dodd wrote:
> > On Wed, 30 Mar 2005, Karl Denninger wrote:
> > > Removing the FIRST delta, which is:
> > >
> > > 218a219,221
> > >       if (!dumping)
> > >           callout_reset(&request->callout, request->timeout * hz,
> > >                         (timeout_t*)ata_timeout, request);
> > >
> > > appears to get rid of the crashes while not harming data integrity OR the
> > > reqeueing.
> > 
> > I'd be interested to know if the attached patch does anything.
> > 
> > -- 
> > 10 40 80 C0 00 FF FF FF FF C0 00 00 00 00 10 AA AA 03 00 00 00 08 00
> > Index: ata-queue.c
> > ===================================================================
> > RCS file: /home/ncvs/src/sys/dev/ata/ata-queue.c,v
> > retrieving revision 1.32.2.6
> > diff -u -u -r1.32.2.6 ata-queue.c
> > --- ata-queue.c	23 Mar 2005 04:50:26 -0000	1.32.2.6
> > +++ ata-queue.c	31 Mar 2005 17:00:46 -0000
> > @@ -217,8 +217,7 @@
> >      }
> >      else {
> >  	if (!dumping)
> > -	    callout_reset(&request->callout, request->timeout * hz,
> > -			  (timeout_t*)ata_timeout, request);
> > +            callout_drain(&request->callout);
> >  	if (request->bio && !(request->flags & ATA_R_TIMEOUT)) {
> >  	    ATA_DEBUG_RQ(request, "finish bio_taskqueue");
> >  	    bio_taskqueue(request->bio, (bio_task_t *)ata_completed, request);
> > 
> 
> It'll be a few hours before I will know on the production machine - the RAID
> array has to rebuild before I can trigger the problem, and we're scheduled
> for some power work here in an hour or so - which I suspect will get in the
> way.
> 
> What do you expect the patch to do, given that removing the delta appears to
> fix the instability problem?

This patch appears to be "safe".

I have about 2 hours on the production machine right now post-rebuild (which
had to complete first) with the added "callout_drain" in, have taken two DMA
WRITE retries, and have not yet seen any evidence of destabilization.

This is good evidence but not proof - before I took out the original line
the FIRST write retry would immediately cause the system to become unstable.

--
-- 
Karl Denninger (karl at denninger.net) Internet Consultant & Kids Rights Activist
http://www.denninger.net	My home on the net - links to everything I do!
http://scubaforum.org		Your UNCENSORED place to talk about DIVING!
http://www.spamcuda.net		SPAM FREE mailboxes - FREE FOR A LIMITED TIME!
http://genesis3.blogspot.com	Musings Of A Sentient Mind




More information about the freebsd-stable mailing list