Quality of FreeBSD

Thu Jul 21 19:26:34 GMT 2005

Ok, Robert, but then here's the question....

How come the ATA code which was very stable in 4.x was screwed with in a
production release, breaking it, with no path backwards to the working
code?

This is a perfectly valid thing to do in -HEAD, where its "heh, you know
this might go BOOM on you!"  I've been told that before when reporting
problems with -HEAD, and while I might not have liked hearing it, its a
valid point of view.

But the same thing in a production release is an entirely different
matter, especially when it impacts MAINSTREAM hardware (the SII chipset 
is EXTREMELY common among SATA implementations, being on basically ALL
PCI plug-in boards, with Hitachi and Maxtor being hardly "uncommon" disks!)

I originally thought perhaps this was a Maxtor problem, given my past
history with them playing a bit "fast and loose" with the rules.  However,
when I replicated the problem on my Hitachi Deskstar drives that theory 
went out the window.

I understand your dissertation below, and agree with it.  However, this is
a case where code was tampered with in ways that broke things for a LOT of
people, myself included, on a PRODUCTION release, and was let loose with
inadequate testing.

It is NOT a situation where obscure, little-used hardware becomes
obsolete and thus ignored - eventually falling into ruin.  This is a
situation where current, in-service hardware on literally millions of
machines becomes suddenly unstable to unusable entirely with FreeBSD.

I understand and expect that if I run -HEAD, I'm asking for it.  I used to
do this on a fairly regular basis ANYWAY, since there were features I
NEEDED in certain environments, and while I did bitch from time to time,
and worked to find solutions when I could, in general this was an "ok"
path for me, with my own personal resources dedicated to testing and
evaluation on the specific hardware which I needed to use.

This is different.  The ATA problems are neither rare or difficult to
reproduce.  Indeed, on the PR I opened, I can take any of the SATA drives
I have (from two different manufacturers - Hitachi and Maxtor), put them
on ANY adapter using the most common (SII) chipset (Adaptec's and Bustek's
both tested) and get the same results - DMA errors when under any
significant load.  

It is trivially easy to reproduce the problem.

I came up with a patch to prevent the disconnects on a mirrored drive 
(but not the errors themselves) which then led to requests that I test 
a bunch of related patches - a request I begrudgingly complied with.  

Why begrudging?  Because the patch contemplated didn't address the problem
- it papered over it.  Now the errors still come, but they don't detach
the disk.  They <DO> severely impact performance though, and for
non-mirrored configurations the results might be data loss instead of a
complaint.  Since data corruption in these circumstances is very difficult
to detect until it has become catastrophic, I'm not about to attempt to 
provoke it on a production machine (which is likely the only way I could
identify WITH CERTAINTY that corruption has taken place.)

So what's going on here Robert?  The PR I filed is still open, it was filed 
on 2/17!  Last activity is from April 4th.  I first noted the issue on 1/31
and failing the note of any real resolution in the codebase forward, I
filed the PR on 2/17 after exhausting my own internal testing and remedy
process.

It is now the middle of July, the ticket is still open, and there is no
path out of this box that I can see.

I understand that there is concern that while ATA-GenX might fix this, it
might also break other things, and thus there is reluctance to MFC it back
into 5.x.  

That's a valid concern, but IMHO it misses the larger point.  

The question unaddressed is why the STABLE code in 4.x was abandoned before 
it was known that the replacement was <AT LEAST> as good as that which it
replaced!

This isn't a "gnat" - it was submitted as "serious", and I meant that 
when I submitted it.  The only reason I didn't consider it "critical" and
"high" priority is that it doesn't hit EVERY configuration - but if it
hits yours, your system is severely impacted.

As things stand right now I'm not even sure WHAT codeset I can CVSUP and
test to have a decent shot at getting a FULLY working ATA/gmirror 
implementation.

--
-- 
Karl Denninger (karl at denninger.net) Internet Consultant & Kids Rights Activist
http://www.denninger.net	My home on the net - links to everything I do!
http://scubaforum.org		Your UNCENSORED place to talk about DIVING!
http://homecuda.com		Emerald Coast: Buy / sell homes, cars, boats!
http://genesis3.blogspot.com	Musings Of A Sentient Mind

On Thu, Jul 21, 2005 at 08:00:40PM +0100, Robert Watson wrote:
> 
> On Thu, 21 Jul 2005, Alexey Yakimovich wrote:
> 
> >First of all thank you very much all for your replies.
> >I just want to add some comments based on previous mails.
> >
> >- I completely agree with MikeM - any kind of complex software could be 
> >tested with right prepared test cases, specially if they are going to be 
> >reused in the next release;
> 
> The trick is balancing the investment of time in different areas, and 
> motivating people to do the things that aren't enjoyable, don't receive 
> much appreciation, etc. Testing is both difficult and time-consuming.  It 
> works best when people are willing to dedicate all or more of their time 
> to the task, since it requires the building of frameworks, the regular 
> application of those tests, etc.  People who step forward to work 
> consistently on testing and bug reporting, like Peter Holm, do the 
> project an invaluable service.  And people like Marc Olzheim who take the 
> time to evaluate the system thoroughly, work through the bug report and 
> fix cycle, and have the patience to deal with situations where there 
> aren't enough hours in the day to fix a problem make it all worthwhile.  
> It's easy to say that more testing should be done, but testing requires 
> as much expertise in the internals of a piece of software as writing it, 
> and far more time.
> 
> >- if those problems happened to 5 branch, probably it would happened 
> >again for 6 or 7, so why I have to switch to 6 right now? Is it because 
> >5 will never be fixed? Does word "production" mean something to FreeBSD 
> >project now?
> 
> As has been discussed extensively in this thread and other threads, the 
> FreeBSD development model typically addresses change at the tree HEAD, 
> where the changes are tested and evaluated, and then they are 
> back-ported. Some changes are low-risk, and are backported quickly (minor 
> locking fixes, error handling, etc).  Others are higher risk, and are 
> backported only when they are felt to have received sufficient testing 
> (driver re-writes, structural changes).  Other changes are considered too 
> large to ever back backported, as you might as well move the users 
> forward as it will be less work and come to much the same thing (major 
> architectural changes, such as SMPng, new hardware platforms, new kernel 
> subsystems). I can't promise that every fix in HEAD (7.x) or the upcoming 
> 6-STABLE branch will make it to 5-STABLE, because many of the changes 
> there won't be appropriate for a backport, or would take so much work to 
> backport that the time is better spent on other tasks.  However, the hope 
> is to bring as many changes as is sensible back.
> 
> As we've already discussed, there are several important improvements 
> germinating in 6.x, and many of them will be things that can and will be 
> backported.  If you look at the network stack differences between 5.x and 
> 6.x, you'll find very few, because I and others have worked to 
> agressively merge fixes, usually on a time lag of between one week and 
> one month.  I know this is also true in other areas of the system.  If 
> you're aware of changes that fix something in 6.x or 7.x that haven't 
> been backported, and it's been over a month, please contact the developer 
> to ask about a backport.
> 
> >- I remember some time ago you can stay on current all the time not 
> >worrying that your box is crashed and didn't auto rebooted;
> 
> Certainly.  I also remember long periods of time where you didn't want to 
> be running current unless you were a VM kernel hacker, such as leading up 
> to the 3.x release cycle, or just after the introduction of background 
> fsck in 5.x.  The 6.x/7.x HEAD branches have been quite on the stable 
> side compared to the 3.x and 5.x development cycle, and my hope is they 
> will remain that way.
> 
> >- chip hardware was always in use by FreeBSD, as far as I remember, or 
> >something is changed recently, specially to US, and people buying only 
> >expensive hardware. Probably it is no longer important to support chip 
> >hardware because of more important FreeBSD clients like Yahoo or Apple 
> >use real hardware, not the stupid one like ATA and they have these 
> >"aggressive" project schedules. Believe me I know what "aggressive" 
> >project schedule means, with long, long list of new features. It is 
> >important for such companies like Yahoo only and I know why, because 
> >it's easy to sell useless product with lots of new features than stable 
> >product with few ones. For regular guy better to have some stable system 
> >running all the time and doing real work (development or providing some 
> >service) than rebooting the box, because of some new fancy feature. It's 
> >getting close to Windows right now.
> 
> All software development involves the balancing of risks and benefits. 
> That's one of the reasons why the FreeBSD Project offers several 
> development branches, which allow users to balance new features and long 
> running "stale" source code.  Notice that we'll be supporting the 4.x 
> branch for several years to come.  Of course, if you run 4.x, you won't 
> be getting many new features, but it's a quite valid option.  And 
> likewise, you won't be able to run properly on the newest hardware, 
> because running on new hardware requires significant architectural 
> changes, such as the introduction of ACPI, rewrites of device driver 
> frameworks, new file systems, and so on.
> 
> >- IBM, Yahoo, Intel, Apple ..., those guys are smart, having millions of 
> >unpaid open source developers working on them. The problem is that some 
> >day those projects will have theirs "aggressive" project schedules, then 
> >will disappeared or changed to .com. So make sure you are still doing 
> >what you like to do and you are having a fun of it.
> 
> I think you'll find many FreeBSD developers enjoy working on FreeBSD best 
> when they receive constructive feedback on the work they do, consisting 
> of thanks when it works, and helpful bug reports when it doesn't.  Some 
> FreeBSD developers live to write new features; others live to get things 
> working "just right", answer questions on mailing lists, or give talks at 
> conferences.  If the balance doesn't seem right, that means there's room 
> for new developers who want to work on the areas that don't get enough 
> attention.  :-)
> 
> Robert N M Watson
> >
> >Thanks,
> >Alexey
> >
> >>-----Original Message-----
> >>From: Robert Watson [mailto:rwatson at FreeBSD.org]
> >>Sent: Thursday, July 21, 2005 5:21 AM
> >>To: Marc Olzheim
> >>Cc: Alexey Yakimovich; freebsd-stable at FreeBSD.org
> >>Subject: Re: Quality of FreeBSD
> >>
> >>
> >>On Thu, 21 Jul 2005, Marc Olzheim wrote:
> >>
> >>>Indeed. That's why my company started taking FreeBSD 5.3 in use for
> >>>production servers when it was out. Since then numerous
> >>bugs were fixed,
> >>>some of which reported by us. Now that we're X bug fixes
> >>later in time
> >>>and started to get a good feeling about the number of open
> >>problems, it
> >>>is extremely annoying to hear the "This will (probably) not
> >>be fixed in
> >>>5.x" statements. That conflicts with 'gradually get
> >>resolved'. What do
> >>>you recommend larger consumers to do ? Keep using FreeBSD 4
> >>and start
> >>>testing FreeBSD 6.x, dropping 5.x all together ?
> >>>
> >>>I know FreeBSD 5 was a strange exception in the relase
> >>scheduling and
> >>>that a lot has been learned from it for the future and I'm
> >>certainly not
> >>>unthankful for all the work that's done, but I'd like a
> >>clear answer on
> >>>what to do now in regard to taking FreeBSD 5 into 'real'
> >>production...
> >>
> >>Marc,
> >>
> >>I should start out by saying I appreciate your clear and concise bug
> >>reports, and the list of your company's show-stopper 5.x bugs
> >>has made the
> >>rounds among FreeBSD developers.  I'm happy that at least one of the
> >>issues on the list was fixed by me. :-)  As you probably saw
> >>yesterday,
> >>I've started bugging Poul-Henning to look at the pty problem you're
> >>experiencing, and will get that on our 6.0 release
> >>show-stopper list.  I
> >>haven't yet had a chance to reproduce it locally, but it
> >>sounds like that
> >>should be straight forward.
> >>
> >>FreeBSD 5 has been an exception -- "normally", in as much as major
> >>releases have a "normal", the set of new features is a lot
> >>less agressive,
> >>and it has been our goal with 6.x to restore the expectation
> >>of a more
> >>rapid release cycle with a less agressive feature set.  This
> >>should reduce
> >>the number of problems by virtue of reducing the level of change.  It
> >>should also make it easier for users to pick what version to
> >>run on, as
> >>the amount of adaptation they have to do to slide forward a
> >>version will
> >>be greatly reduced.  I.e., right now it's relatively easy to
> >>move back and
> >>forward between 5.x and 6.x.
> >>
> >>With respect to 5.x vs 6.x upgrades: I've seen companies take two
> >>different strategies.  Most of them have been at least
> >>experimenting with
> >>deploying 5.x, and are very interested in its feature set.
> >>Support for
> >>large file systems, 64-bit support on newer AMD and Intel hardware,
> >>improved PAM support, etc.  Some of my customers are specifically
> >>interested in the support for mandatory access control, but that's
> >>obviously a less common feature request :-).  The biggest determining
> >>factor for companies today comes from their own product
> >>schedule, since
> >>most big consumers of FreeBSD treat it as a component in a
> >>"product" they
> >>deliver for others.
> >>
> >>For example, my understanding is that Yahoo is now deploying
> >>6.0 betas
> >>across their server environment with great success, but was actually
> >>unable to seriously deploy 5.x because their goal was to support full
> >>32-bit compatibility on 64-bit amd/intel hardware, which has
> >>only recently
> >>reached the level of maturity they require.  In fact, you'll
> >>notice if you
> >>follow FreeBSD commit logs that much of that support has come
> >>from Yahoo!.
> >>Since 6.x is maturing in pretty good synch with their
> >>deployment timeline
> >>for 5.x, they are actually deploying 6.x.  Of course, Yahoo!
> >>has a team of
> >>in-house OS developers who adapt FreeBSD for their needs, and
> >>is quite
> >>capable of debugging a kernel or two if they run into problems.
> >>
> >>The ATA driver issue is a sticky one for many users -- we
> >>hope to get the
> >>6.x ATA code back into 5.x in the next 5.x release.  However,
> >>hard-earned
> >>experience tells us that ATA driver code is notoriously
> >>difficult to get
> >>right across the broad range of available hardware.  Soren has been
> >>lobbying to get it merged to 5.x, but given the level of
> >>testing performed
> >>so far, we can't yet justify the merge.  My hope is that with
> >>6.0 out the
> >>door and a lot of testing of that code, we can get it merged
> >>back to 5.x
> >>before 5.5.  Many other fixes have gone into 5.x, correcting
> >>many of the
> >>most significant issues.  If you compare 5.4 with 5.3, you'll
> >>find that in
> >>most cases, it's both faster and more stable.
> >>
> >>The tty issue is a sticky one also.  The tty code in 6.x has been
> >>substantially rewritten to better support the SMPng
> >>environment.  Because
> >>the tty code "plugs in" to a number of device drivers, T1
> >>adapter drivers,
> >>etc, changing the tty interfaces is a fairly big event, and
> >>will affect
> >>third party vendors like Cronyx.  This code has also not yet
> >>seen as wide
> >>deployment as I'd like, so it's also something that really isn't
> >>appropriate for an MFC immediately.  However, once it has
> >>seen significant
> >>6.0 deployment, it may well be.  A question then will be whether it's
> >>better to simply say "you're better off making the jump to
> >>6.x, which is
> >>minor" than backporting, and it's something we can't really
> >>answer until
> >>we're comfortable that it's seen sufficient deployment.  My
> >>hope is that
> >>we can identify a workaround for 5.x that will avoid the code
> >>upheaval a
> >>full backport would require.  It's not as ideal as having the
> >>"right" fix,
> >>but it would stop the panics.  I need to ping phk and some of
> >>the other
> >>tty-centric folk to look at this some more.
> >>
> >>In terms of advice:
> >>
> >>If you have a "product" due out more than 3 months from now,
> >>I think 6.x
> >>is the obvious way to go: you want to be ahead of the curve
> >>so that you
> >>can have the foundation for your product in sync with the FreeBSD
> >>production release cycle, and avoid jumping major releases
> >>early in the
> >>product life cycle.  6.x has significant performance and stability
> >>improvements -- performance especially in the area of file system
> >>performance on SMP, preemption, network stack, and memory
> >>management, and
> >>stability especially in the area of tty support.  By
> >>"product", I mean a
> >>range of things: the OS foundation of an embedded product such as a
> >>firewall or storage appliance, or deployment of an internal
> >>product, such
> >>as a virtual server product at an ISP.
> >>
> >>On the other hand, if you're deploying today, I think that
> >>unless you're
> >>prepared to deal with the 6.0 bug fix cycle (both the BETA/RC
> >>cycle, and
> >>the inevitable post-release fixes for a .0 release), 5.4+patches or
> >>5-STABLE is the right place to sit.  At least two of the
> >>critical bugs on
> >>your list were fixed in 5-STABLE after the release of 5.4, so
> >>for some,
> >>5-STABLE is the best place to be.  We've opted not to do a
> >>patch/errata
> >>update for 5.4 for the socket error you were receiving on the
> >>basis that
> >>it doesn't affect a wide audience and doesn't correct a
> >>"Critical" failure
> >>-- i.e., a crash or the like, unlike some of the NFS server
> >>fixes, for
> >>which we did do an errata fix.
> >>
> >>From the perspective of the FreeBSD developers, if you can
> >>tolerate the
> >>6.x release process, we encourage you to jump on that
> >>bandwagon.  It will
> >>help us release a better 6.0, and that's where the future
> >>lies.  Our goal
> >>is to make 6.x a pretty seemless upgrade from 5.x, as it has a less
> >>agressive feature set, and far fewer user-visible changes (i.e., no
> >>conversion to OpenPAM, devfs,?UFS2, large compiler version
> >>upgrade, ... as
> >>in 5.x).  When I upgraded my personal web/shell server to 6.x
> >>from 5.x
> >>last week, I didn't have to change any configuration in /etc
> >>at all, other
> >>than a painless pass through mergemaster to merge the _dhcp user and
> >>group.  As always, we look to the freebsd-stable users to
> >>help us test new
> >>features ahead of the release.
> >>
> >>Thanks,
> >>
> >>Robert N M Watson
> >>
> >
> >

> _______________________________________________
> freebsd-stable at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org"