quota deadlock on 6.1-RC1

Wed May 3 10:25:44 UTC 2006

On Tue, 2 May 2006, Kris Kennaway wrote:

>>>> Ditto, same thing with the recent nve fixes. Why release known broken
>>>> code when there are tested patches available? Whats the worst that will
>>>> happen? It wont work? Thats already the case...
<...>
>
> OK, I can't speak to that issue specifically.
>
> Generally, though, the worst that can happen is "you fix one problem 
> affecting a subset of users and replace it with a larger problem affecting a 
> larger subset of users".
>
> If there's doubt about the impact of a change, 10 seconds before the release 
> is not the appropriate time to cram it in.
<...>

I just want to comment a bit on this issue, because I've seen a number of 
posts on FreeBSD mailing lists over the last few years that suggest that there 
may be some misunderstandings about software development and releases 
processes.

The invariant that needs to be understood is that all software is buggy; 
arguments have been made that the number of bugs increases linearly with code 
size, and there have also been arguments made that the number of bugs 
increases with code complexity, so you can see a non-linear increase in bugs 
with code growth.  This means that you're talking about several bugs per 
thousand lines of code in most software, and for code that contains millions 
of lines of code (such as the FreeBSD kernel, Linux kernel, Apache, PhP, 
MySQL, PostgreSQL, Windows, Word, iTunes, etc), you're talking thousands or 
tens of thousands of bugs.  And that's in a static version of the code, not 
even taking into account new features in an active code base that are still 
being "debugged"!

Bugs fall into a lot of different categories, but from the perspective of risk 
management, it's useful to think of them in two categories: latent bugs, which 
are unreported, unobserved, or occur only in exceptional or generally 
untriggered circumstances, and non-latent bugs, which have been reported, are 
triggered in practice, etc.  The tricky ones are the latent bugs, because you 
may not know that they are there, or you may know that they are there but 
trigger so infrequently or in such unusual edge cases that they almost might 
as well not be there.

Release engineering is really about two things: structuring/nurturing the 
process of developing releases (tracking issues, identifying people to fix 
them, testing, branch management, building, etc), and risk management.  The 
risk management aspect is that you want to improve the quality of the release 
by taking actions, typically adopting source changes, which may improve 
testing results.  Each change potentially affects both visible and latent 
bugs.  Bug fixes in one piece of code may change the timing of the code, the 
side effects, undocumented assumptions, or simply allow access to code 
previously not executed because the bug prevented it.  If you allow a bug fix 
into the tree, you risk uncovering new bugs.  So the choice isn't "Accept a 
bug fix or not", it's "Will accepting this bug fix generally improve or reduce 
quality of the release" -- i.e., will the change fix the bug it is claimed to 
fix, and will it result in lots of latent bugs suddenly becoming visible.

Particular with hardware drivers like nve, this is non-trivial, because the 
behavior of the hardware is very subtle, there's lots of variety in the 
shipped hardware, and the vendor is (or appears) highly unsupportive.  The 
result is that if you tweak a register or minor piece of behavior, it 
dramatically improve support for a particular piece of hardware, but break all 
the rest.  The only way to mitigate this risk is through extensive testing, 
and extensive testing takes a lot of time.  And by a lot of time, I mean, a 
long release cycle.  So if we want to adopt a fix that is high risk -- i.e., 
is believed will interact in subtle ways that affect different machines 
differently -- we need to make the change early in the release cycle, not at 
the end.  If we make it at the end, we are shipping code that is effectively 
untested on a large number of systems.  Sure, it will fix one, but if it 
breaks the rest, is it worth it?  The only alternative is to restart the 
testing process, which in the case of high-risk drivers, means adding months 
to the release cycle.

And you can see where this is leading: if you significantly delay the release 
cycle for each minor bug, you will never release.  At some point, you have to 
make the decision "although this release isn't perfect, we'll never release if 
we don't ship now".  I know that sounds like a bad thing, but you'll find that 
that practice is not only found in every part of the software industry, but 
it's also impossible to avoid, since bug-free software is impossible to avoid.

When you look at the RC2 release notes Scott recently sent, he identifies four 
bugs that he believes won't be fixed in time for the release.  He decided that 
this was the case using risk management: each bug actually likely represents 
several bugs with the same features, in highly complex code.  This means that 
they will take a significant amount of time to fix, and that each fix is high 
risk, as it is likely to reveal latent bugs.  This means that each fix will 
require a lot of testing -- months of testing, in fact.  So the choice is 
really, do we release 6.1, or do we skip it and do a 6.2 in a few months.  As 
the release engineer, Scott has concluded that releasing now offers a great 
benefit to many people, although the bugs present may penalize some.  Mind 
you, in some cases the bugs also exist in 6.0, so they don't represent 
regressions, so much as bugs that continue to persist.  I agree with his 
conclusion: things like locking interactions in VFS are incredibly 
complicated, requiring extensive analysis and work to fix and test.  Trying to 
fix them for 6.1 is unrealistic.  They can be fixed in the next few weeks, 
tested for a month or two, and then merged to the RELENG_6_1 branch as errata 
fixes, similar to security advisories.

It's all about trade-offs.  People are welcome to (and frequently do) disagree 
with our analysis and choice on the trade-offs, but what I'm trying to 
emphasize in this e-mail is that these trade-offs are a reality.  They can't 
be ignored: bug-free releases of software can't be shipped because they don't 
exist, and therefore the argument (decision) is always about where the right 
balance is.  Arguing for waiting to ship until every last bug is fixed is 
arguing never to release software -- bugs are present in all software, and not 
all latent either -- that's why products have errata notes (as does FreeBSD), 
patch levels, etc.  Don't believe this means we don't think fixing bugs is 
important, and that we don't spend long days and nights (and more days and 
more nights) working on it.

FWIW, if you look at the release process of any other commercial or open 
source software product, you'll see the same thing.  Either there's no bug 
database, or there's a very large database.  If there's no database, it's 
because the developer isn't being honest about there being bugs, or they have 
no testing.  If there's a huge database, they are, and they're not all going 
to get shipped.  Software authors select bugs to fix based on the impact of 
the bugs and their ability to fix them.  I'd like to think we care more than 
some, but caring isn't enough to make computer software development perfect, 
or it would have happened a long time ago :-).

Thanks,

Robert N M Watson