make -j as a stress test (was: Re: Quality of FreeBSD) [WARNING
- 6.0-BETA1 still hosed!]
karl at denninger.net
Sat Jul 23 04:16:47 GMT 2005
On Fri, Jul 22, 2005 at 07:53:00PM -0700, Danny Howard wrote:
> On Fri, Jul 22, 2005 at 02:53:57PM -0500, Karl Denninger wrote:
> > Note carefully from this that there is NO ERROR INDICATION AS TO WHY THE
> > DISK DETACHED!
> > At least with the 5.x problems you'd SEE an error before it went BOOM.
> > This time around, nope - just death.
> > What's worse, the complaints continue even through a shutdown ...
> While I agree with Karl that introducing instability is a very bad
> thing, I guess we now have an answer to Karl's vexation yesterday:
> [ http://lists.freebsd.org/pipermail/freebsd-stable/2005-July/017210.html ]
> "What I don't understand Robert is why Soren's code is "too
> sensitive" to commit, but the explosive reduction in stability
> that the changes made between 4.x and 5.3 caused weren't
> enough to back THAT out until it could be fixed."
> The answer would seem to be that when someone actually does test the
> untested code, it is even worse than the code we are already upset with.
Can we get a <COMMITMENT> from the development team that 6.x will <NOT> go
out the door until this problem is identified and FIXED (e.g. the PR I
submitted against this early in the year is closed)?
The problem is trivially easy to reproduce, as I've pointed out. My
hardware is hardly anything special - its a Dell Poweredge 400SC, a
rather pedestrian 2.4Ghz P4/HT machine with 512MB of RAM and nothing
special in terms of boards in it. Indeed, on the sandbox machine the
ONLY cards in the machine are the Adaptec SATA card and a video board!
The ICH SATA onboard adapter works fine. No problems, even if you beat
the snot out of the disks. Ditto for the onboard PATA channels.
ANY PCI SII-chipset SATA card (nothing fancy here, no onboard RAID,
just a disk adapter) that I've tried thus far - Bustek or Adaptec - causes
trouble in an absolutely reproducable fashion when put under heavy load.
If both channels are in use the trouble is immediate and dramatic, although
you CAN provoke errors even with only one of the two channels in operation
if you can get the I/O load up high enough.
Gmirror is great for provoking this as it queues traffic to both channels
in a nicely balanced and heavily-utilized fashion, although I'm willing to
bet that Gmirror itself is not involved as the actual cause of the
problem, since I had trouble once DURING install (before I had put a
gmirror'ed config on the disks.)
Note that a MIX of read and writes appears to be required - a REBUILD of
the disks by Gmirror (which is all writes to those two disks) succeeds.
As soon as you have all three subdisks in the array, however, a
"make buildworld" produces fireworks.
If necessary (or useful) I can give one or more developers a way to log
into the sandbox machine here via ssh. I do not have a way to get a
serial console on the box, however, so if its blown up in an unrecoverable
fashion remotely someone would have to call or IM me to push the big red
If that's NOT necessary (or desired), then I want to move those two disks
back to the production machine as they are how my offsite/offline backups
are done - I've no problem with leaving them on the sandbox IF the problem
is being actively worked though.
Karl Denninger (karl at denninger.net) Internet Consultant & Kids Rights Activist
http://www.denninger.net My home on the net - links to everything I do!
http://scubaforum.org Your UNCENSORED place to talk about DIVING!
http://homecuda.com Emerald Coast: Buy / sell homes, cars, boats!
http://genesis3.blogspot.com Musings Of A Sentient Mind
More information about the freebsd-stable