make -j as a stress test (was: Re: Quality of FreeBSD) [WARNING - 6.0-BETA1 still hosed!]

Karl Denninger karl at denninger.net
Sat Jul 23 04:16:47 GMT 2005


On Fri, Jul 22, 2005 at 07:53:00PM -0700, Danny Howard wrote:
> On Fri, Jul 22, 2005 at 02:53:57PM -0500, Karl Denninger wrote:
> [...]
> > Note carefully from this that there is NO ERROR INDICATION AS TO WHY THE
> > DISK DETACHED!
> > 
> > At least with the 5.x problems you'd SEE an error before it went BOOM.
> > 
> > This time around, nope - just death.
> > 
> > What's worse, the complaints continue even through a shutdown ...
> 
> While I agree with Karl that introducing instability is a very bad
> thing, I guess we now have an answer to Karl's vexation yesterday:
> [ http://lists.freebsd.org/pipermail/freebsd-stable/2005-July/017210.html ]
> 
>      "What I don't understand Robert is why Soren's code is "too
>      sensitive" to commit, but the explosive reduction in stability
>      that the changes made between 4.x and 5.3 caused weren't
>      enough to back THAT out until it could be fixed."
> 
> The answer would seem to be that when someone actually does test the
> untested code, it is even worse than the code we are already upset with.
> :)
> 
> Love,
> -danny

Point taken.

Can we get a <COMMITMENT> from the development team that 6.x will <NOT> go 
out the door until this problem is identified and FIXED (e.g. the PR I
submitted against this early in the year is closed)?

The problem is trivially easy to reproduce, as I've pointed out.  My 
hardware is hardly anything special - its a Dell Poweredge 400SC, a 
rather pedestrian 2.4Ghz P4/HT machine with 512MB of RAM and nothing 
special in terms of boards in it.  Indeed, on the sandbox machine the 
ONLY cards in the machine are the Adaptec SATA card and a video board!

The ICH SATA onboard adapter works fine.  No problems, even if you beat
the snot out of the disks.  Ditto for the onboard PATA channels.

ANY PCI SII-chipset SATA card (nothing fancy here, no onboard RAID,
just a disk adapter) that I've tried thus far - Bustek or Adaptec - causes 
trouble in an absolutely reproducable fashion when put under heavy load.  

If both channels are in use the trouble is immediate and dramatic, although 
you CAN provoke errors even with only one of the two channels in operation
if you can get the I/O load up high enough.

Gmirror is great for provoking this as it queues traffic to both channels
in a nicely balanced and heavily-utilized fashion, although I'm willing to
bet that Gmirror itself is not involved as the actual cause of the
problem, since I had trouble once DURING install (before I had put a
gmirror'ed config on the disks.)  

Note that a MIX of read and writes appears to be required - a REBUILD of
the disks by Gmirror (which is all writes to those two disks) succeeds.

As soon as you have all three subdisks in the array, however, a 
"make buildworld" produces fireworks.

If necessary (or useful) I can give one or more developers a way to log 
into the sandbox machine here via ssh.  I do not have a way to get a
serial console on the box, however, so if its blown up in an unrecoverable
fashion remotely someone would have to call or IM me to push the big red 
button.

If that's NOT necessary (or desired), then I want to move those two disks 
back to the production machine as they are how my offsite/offline backups
are done - I've no problem with leaving them on the sandbox IF the problem
is being actively worked though.

--
-- 
Karl Denninger (karl at denninger.net) Internet Consultant & Kids Rights Activist
http://www.denninger.net	My home on the net - links to everything I do!
http://scubaforum.org		Your UNCENSORED place to talk about DIVING!
http://homecuda.com		Emerald Coast: Buy / sell homes, cars, boats!
http://genesis3.blogspot.com	Musings Of A Sentient Mind




More information about the freebsd-stable mailing list