DANGER WILL ROBINSON! SERIOUS problem with current
karl at denninger.net
Tue Mar 29 21:08:32 PST 2005
On Tue, Mar 29, 2005 at 11:40:48PM -0500, Matthew N. Dodd wrote:
> On Tue, 29 Mar 2005, Karl Denninger wrote:
> > 1.42: When resubmitting a timed out request, reset donecount.
> > 1.41: Reset timeout when we are back from interrupt.
> > 1.40: Correct logical error, result was that retries wasn't always made but
> > failure reported instead.
> > 1.39: Do not retry on requests that have lost their device during reinit.
> > This change is EXTREMELY DANGEROUS.
> > This change needs to be backed out immediately until it can be determined
> > why a requeued request destabilizes the system.
> The changes in question are very small. Could you attempt to isolate
> which one is the cause?
Pretty sure its the requeue (e.g. 1.40 and 1.42); I attempted to put this
patch in the system back before it was MFC'd (when it orginally showed up in
-HEAD) and it failed in exactly the same way. The first time it created a
LOT of head-scratching ("how come my serial board has suddenly gone deaf?!")
and it wasn't until it got to where the console wouldn't respond that the
light went on and I said "oh, so THAT's what that patch really does!" :->
That got backed out FAST :-)
I believe the previous version of that file in -STABLE was 1.38 - that has
the 'errors don't actually get retried' problem that results in immediate
detaches - the reason for the update was that I noted the commit and
figured that the problem from my last attempt with including this had
either been fixed or I had missed some dependancy in my earlier attempt.
I have an open PR on the underlying problem (SATA drives on a number of
common configurations returning false errors and detaching when part of a
geom mirror) which I've marked as "serious". Its at
There is a comment attached to the PR from another user who has duplicated
the underlying problem.
Note that back on 3/2/05 I attempted to apply the 1.42 version of this file
to -STABLE and got the same failure, and added that fact to the PR. I also
reported it here. It appears that both reports were either missed or ignored
and this change was committed to -RELENG_5.
I'm not sure if I can cobble up a test machine with the right configuration
of hardware to go through each of the above changes in turn to see if I can
isolate which of the three it is, but I'll give it a shot over the next
couple of days. I'm 1 SATA disk short of what I need to do this in my
If I do not trigger the requeue all appears to be fine.
This is one that IMHO has to either be found and fixed or backed out for the
Karl Denninger (karl at denninger.net) Internet Consultant & Kids Rights Activist
http://www.denninger.net My home on the net - links to everything I do!
http://scubaforum.org Your UNCENSORED place to talk about DIVING!
http://www.spamcuda.net SPAM FREE mailboxes - FREE FOR A LIMITED TIME!
http://genesis3.blogspot.com Musings Of A Sentient Mind
More information about the freebsd-stable