DANGER WILL ROBINSON! SERIOUS problem with current 5.4-PRERELEASE

Karl Denninger karl at denninger.net
Tue Mar 29 21:08:32 PST 2005


On Tue, Mar 29, 2005 at 11:40:48PM -0500, Matthew N. Dodd wrote:
> On Tue, 29 Mar 2005, Karl Denninger wrote:
> >  1.42: When resubmitting a timed out request, reset donecount.
> >  1.41: Reset timeout when we are back from interrupt.
> >  1.40: Correct logical error, result was that retries wasn't always made but
> >        failure reported instead.
> >  1.39: Do not retry on requests that have lost their device during reinit.
> >
> > This change is EXTREMELY DANGEROUS.
> >
> > This change needs to be backed out immediately until it can be determined
> > why a requeued request destabilizes the system.
> 
> The changes in question are very small.  Could you attempt to isolate 
> which one is the cause?
> 
> Thanks.

Pretty sure its the requeue (e.g. 1.40 and 1.42); I attempted to put this
patch in the system back before it was MFC'd (when it orginally showed up in
-HEAD) and it failed in exactly the same way.  The first time it created a
LOT of head-scratching ("how come my serial board has suddenly gone deaf?!")
and it wasn't until it got to where the console wouldn't respond that the
light went on and I said "oh, so THAT's what that patch really does!" :->

That got backed out FAST :-)

I believe the previous version of that file in -STABLE was 1.38 - that has 
the 'errors don't actually get retried' problem that results in immediate 
detaches - the reason for the update was that I noted the commit and 
figured that the problem from my last attempt with including this had 
either been fixed or I had missed some dependancy in my earlier attempt.

I have an open PR on the underlying problem (SATA drives on a number of
common configurations returning false errors and detaching when part of a
geom mirror) which I've marked as "serious".  Its at 
http://www.freebsd.org/cgi/query-pr.cgi?pr=77643

There is a comment attached to the PR from another user who has duplicated 
the underlying problem.

Note that back on 3/2/05 I attempted to apply the 1.42 version of this file
to -STABLE and got the same failure, and added that fact to the PR.  I also
reported it here.  It appears that both reports were either missed or ignored 
and this change was committed to -RELENG_5.

I'm not sure if I can cobble up a test machine with the right configuration
of hardware to go through each of the above changes in turn to see if I can
isolate which of the three it is, but I'll give it a shot over the next
couple of days.  I'm 1 SATA disk short of what I need to do this in my 
sandbox.

If I do not trigger the requeue all appears to be fine.

This is one that IMHO has to either be found and fixed or backed out for the
impending -RELEASE.

--
-- 
Karl Denninger (karl at denninger.net) Internet Consultant & Kids Rights Activist
http://www.denninger.net	My home on the net - links to everything I do!
http://scubaforum.org		Your UNCENSORED place to talk about DIVING!
http://www.spamcuda.net		SPAM FREE mailboxes - FREE FOR A LIMITED TIME!
http://genesis3.blogspot.com	Musings Of A Sentient Mind




More information about the freebsd-stable mailing list