ARRRRGH! Guys, who's breaking -STABLE's GMIRROR code?!

Karl Denninger karl at denninger.net
Sat Sep 9 15:43:19 PDT 2006


On Sat, Sep 09, 2006 at 04:04:40PM -0300, Marc G. Fournier wrote:
> On Sat, 9 Sep 2006, Karl Denninger wrote:
> 
> >Yeah, -STABLE is what you should run if you want stable code, right?
> >
> >C'mon guys.  This sort of thing belies a total lack of concern when 
> >changes are MFC'd into production branches of the code.  This kind of 
> >thing is expected if you're running -CURRENT, but not -STABLE.
> >
> >How long would it have taken to actually test the change and detect this 
> >once it was put in?  All of 30 seconds?
> 
> In this case, I don't know ... but I *do* know that I do hit a fair 
> number of "bugs" that a simple 30 second test won't uncover ... a 
> production box *can* and *will* tend to hit bugs that a test box won't, 
> just because of the randomness of what is running on it ... trust me, I've 
> had my share of headaches over the years, but it doesn't (and won't) deter 
> me from running -STABLE, for the simple fact that if I don't, there is a 
> good chance that those bugs that I do get "lucky" enough to hit won't get 
> hit by anyone else and *someone* had to get it ;)

Well sure, if its one of those "corner cases" I understand.  This is the price
of not doing FULL regression testing, and expecting that from a free project
is unreasonable.  Hell, you don't get that from Micro$oft, why would anyone
think you'd get it here?

But in this situation its not a corner case.  I've got a (different) open issue 
on 6.x where it appears that SELECT on serial lines is badly screwed; this may
be specific to the ROCKETPORT cards and it may not - not real sure yet.  I
reported that one recently too, and its giving me a 5-alarm migrane at the
moment trying to find a workaround that actually functions.  I can't find
anything in the commit logs that would lead me to believe that the ttyio 
code has changed in a way that should have caused this, and the driver 
hasn't been updated either.  That's a head-scratcher for a whole host
of reasons with the first one being that I don't have the first clue 
where to look for the source of trouble (to use a pun.)

Its not as simple as "serial I/O doesn't work at all"; it appears to be
specific to using VMIN, non-blocking I/O and select() to handle multiple
sources of input coming into a single thread.  Now how often do people do
this?  I dunno..... but what I do know is that the common "single thread"
application works fine on the same port....

This is different.  We're talking about the very basic functionality of 
the gmirror system - to be able to rebuild a disk that is out of sync.

In this case my "notice" of the problem came in the form of a production
machine that went down overnight - apparently, it would seem, during an
attempt to back itself up using that functionality.  It went down HARD
and corrupted the root partition directory structure badly enough to prevent
fsck from being able to rebuild it on an automated restart attempt, and what
was worse, the bug caused the system to block in I/O permanently as of course
when it came back up from the crash it tried to resync the out-of-date
providers, making the reboot hang!  So what I had was a production machine 
that couldn't be brought back up without significant "wizardry" at the 
physical console, and frankly, what it LOOKED LIKE at first blush was a
<double> disk failure - one of those "that's not supposed to happen" things.

I was very close to putting the day-old backup disk online - I'm darn glad I
didn't, because the bug would have likely trashed THAT one too, and then I'd
be both a day back on the data AND have an unstable system!

Not good, especially when the commit log on the last delta to the gmirror code
was basically "removed uses of the F-word in comments; we're nice people".

Uh, obviously not.

The obvious question is how does the protocol for committing changes to
-STABLE work if the committer isn't required to first test the basic function
set of the module he/she modifies, on -STABLE, before those changes are MFC'd 
back into the -STABLE tree?

I see that the (actual) code changes were backed out (apparently yesterday)
and I've rebuilt the kernel with those, which has put the immediate fire out,
but this is one of those instances where the usual "check and balance" process
that is <ADVERTISED> as being present in -STABLE failed badly, and it failed
simply due to a lack of checking at all!

--
-- 
Karl Denninger (karl at denninger.net) Internet Consultant & Kids Rights Activist
http://www.denninger.net	My home on the net - links to everything I do!
http://scubaforum.org		Your UNCENSORED place to talk about DIVING!
http://genesis3.blogspot.com	Musings Of A Sentient Mind




More information about the freebsd-stable mailing list