quota deadlock on 6.1-RC1

Wed May 3 23:38:46 UTC 2006

Robert Watson wrote:
> 
> On Tue, 2 May 2006, Kris Kennaway wrote:
> 
>>>>> Ditto, same thing with the recent nve fixes. Why release known broken
>>>>> code when there are tested patches available? Whats the worst that
>>>>> will
>>>>> happen? It wont work? Thats already the case...
> 
> <...>
> 
>>
>> OK, I can't speak to that issue specifically.
>>
>> Generally, though, the worst that can happen is "you fix one problem
>> affecting a subset of users and replace it with a larger problem
>> affecting a larger subset of users".
>>
>> If there's doubt about the impact of a change, 10 seconds before the
>> release is not the appropriate time to cram it in.
> 
> <...>
> 
> I just want to comment a bit on this issue, because I've seen a number
> of posts on FreeBSD mailing lists over the last few years that suggest
> that there may be some misunderstandings about software development and
> releases processes.
> 
> The invariant that needs to be understood is that all software is buggy;
> arguments have been made that the number of bugs increases linearly with
> code size, and there have also been arguments made that the number of
> bugs increases with code complexity, so you can see a non-linear
> increase in bugs with code growth.  This means that you're talking about
> several bugs per thousand lines of code in most software, and for code
> that contains millions of lines of code (such as the FreeBSD kernel,
> Linux kernel, Apache, PhP, MySQL, PostgreSQL, Windows, Word, iTunes,
> etc), you're talking thousands or tens of thousands of bugs.  And that's
> in a static version of the code, not even taking into account new
> features in an active code base that are still being "debugged"!
> 
> Bugs fall into a lot of different categories, but from the perspective
> of risk management, it's useful to think of them in two categories:
> latent bugs, which are unreported, unobserved, or occur only in
> exceptional or generally untriggered circumstances, and non-latent bugs,
> which have been reported, are triggered in practice, etc.  The tricky
> ones are the latent bugs, because you may not know that they are there,
> or you may know that they are there but trigger so infrequently or in
> such unusual edge cases that they almost might as well not be there.
> 
> Release engineering is really about two things: structuring/nurturing
> the process of developing releases (tracking issues, identifying people
> to fix them, testing, branch management, building, etc), and risk
> management.  The risk management aspect is that you want to improve the
> quality of the release by taking actions, typically adopting source
> changes, which may improve testing results.  Each change potentially
> affects both visible and latent bugs.  Bug fixes in one piece of code
> may change the timing of the code, the side effects, undocumented
> assumptions, or simply allow access to code previously not executed
> because the bug prevented it.  If you allow a bug fix into the tree, you
> risk uncovering new bugs.  So the choice isn't "Accept a bug fix or
> not", it's "Will accepting this bug fix generally improve or reduce
> quality of the release" -- i.e., will the change fix the bug it is
> claimed to fix, and will it result in lots of latent bugs suddenly
> becoming visible.
> 
> Particular with hardware drivers like nve, this is non-trivial, because
> the behavior of the hardware is very subtle, there's lots of variety in
> the shipped hardware, and the vendor is (or appears) highly
> unsupportive.  The result is that if you tweak a register or minor piece
> of behavior, it dramatically improve support for a particular piece of
> hardware, but break all the rest.  The only way to mitigate this risk is
> through extensive testing, and extensive testing takes a lot of time. 
> And by a lot of time, I mean, a long release cycle.  So if we want to
> adopt a fix that is high risk -- i.e., is believed will interact in
> subtle ways that affect different machines differently -- we need to
> make the change early in the release cycle, not at the end.  If we make
> it at the end, we are shipping code that is effectively untested on a
> large number of systems.  Sure, it will fix one, but if it breaks the
> rest, is it worth it?  The only alternative is to restart the testing
> process, which in the case of high-risk drivers, means adding months to
> the release cycle.
> 
> And you can see where this is leading: if you significantly delay the
> release cycle for each minor bug, you will never release.  At some
> point, you have to make the decision "although this release isn't
> perfect, we'll never release if we don't ship now".  I know that sounds
> like a bad thing, but you'll find that that practice is not only found
> in every part of the software industry, but it's also impossible to
> avoid, since bug-free software is impossible to avoid.
> 
> When you look at the RC2 release notes Scott recently sent, he
> identifies four bugs that he believes won't be fixed in time for the
> release.  He decided that this was the case using risk management: each
> bug actually likely represents several bugs with the same features, in
> highly complex code.  This means that they will take a significant
> amount of time to fix, and that each fix is high risk, as it is likely
> to reveal latent bugs.  This means that each fix will require a lot of
> testing -- months of testing, in fact.  So the choice is really, do we
> release 6.1, or do we skip it and do a 6.2 in a few months.  As the
> release engineer, Scott has concluded that releasing now offers a great
> benefit to many people, although the bugs present may penalize some. 
> Mind you, in some cases the bugs also exist in 6.0, so they don't
> represent regressions, so much as bugs that continue to persist.  I
> agree with his conclusion: things like locking interactions in VFS are
> incredibly complicated, requiring extensive analysis and work to fix and
> test.  Trying to fix them for 6.1 is unrealistic.  They can be fixed in
> the next few weeks, tested for a month or two, and then merged to the
> RELENG_6_1 branch as errata fixes, similar to security advisories.
> 
> It's all about trade-offs.  People are welcome to (and frequently do)
> disagree with our analysis and choice on the trade-offs, but what I'm
> trying to emphasize in this e-mail is that these trade-offs are a
> reality.  They can't be ignored: bug-free releases of software can't be
> shipped because they don't exist, and therefore the argument (decision)
> is always about where the right balance is.  Arguing for waiting to ship
> until every last bug is fixed is arguing never to release software --
> bugs are present in all software, and not all latent either -- that's
> why products have errata notes (as does FreeBSD), patch levels, etc. 
> Don't believe this means we don't think fixing bugs is important, and
> that we don't spend long days and nights (and more days and more nights)
> working on it.
> 
> FWIW, if you look at the release process of any other commercial or open
> source software product, you'll see the same thing.  Either there's no
> bug database, or there's a very large database.  If there's no database,
> it's because the developer isn't being honest about there being bugs, or
> they have no testing.  If there's a huge database, they are, and they're
> not all going to get shipped.  Software authors select bugs to fix based
> on the impact of the bugs and their ability to fix them.  I'd like to
> think we care more than some, but caring isn't enough to make computer
> software development perfect, or it would have happened a long time ago
> :-).
> 
> Thanks,
> 
> Robert N M Watson

thank you!
very nice!