FCP 20190401-ci_policy: CI policy

Sat Aug 31 17:01:15 UTC 2019

After this weeks discussions, I got to thinking about where we were on this
topic.  I thought I'd spend a few minutes summarizing my impressions so
that we don't get bogged down in the details of disagreement, but rather
start from where we agree. Here's my take on what that is, but I write this
down as the basis for discussion, not to lay down the law.

I think there's consensus on the following points:

(1) We want CI
(2) We want developers to be responsive to breakage in CI
(3) At the moment, we have some sub-optimal tools, and when we change those
we should evolve the process.
(4) Build breakage is something that should be fixed very quickly.
(5) People should reach out to the original developer who committed the
change when there's breakage ASAP
(6) If the original developer can't timely fix the problem, it's OK to
either back out the change, or perhaps commit a tiny fix if the original
was huge and the fix needed to fix the build is tiny.
(7) Breaking tests is a problem, but our tests need to evolve because we
have too high a rate of false positives.
(8) There's a sliding scale of urgency (Tier 1 build breakage needs to be
fixed in a couple of hours, Tier 2 can go a day, Tier 3 can go longer but
shouldn't linger).
(9) The urgency for a Test regression also is a sliding scale, but given
the current state of the tests we need to apply judgement on revert vs fix
test vs disable test.
(10) Reverts shouldn't be feared, but there's a cost to reverting
automatically and there's some desire for developers to have a chance to be
in the loop
(11) We need to work on the social aspect of reverts to destigmatize them.

We might quibble a bit over timelines for the different pieces, but here's
what I've noticed the approximate timelines are today (there are
exceptions, and I don't have hard data, just my sense from watching the
tree, but I think they form a reasonable basis absent better data):

* Build system breakage usually is fixed within an hour (eg, I screwed up a
Makefile or bsd.foo.mk file somehow).
* x86 build breakages are usually fixed in an hour or two (longer over the
weekend)
* arm and arm64 build breakages are usually fixed within 4-8 hours
* Other build breakages are usually fixed within a day or two.
* out of tree compiler breakages are fixed on the order of a week.
* I have little data on test breakage, but it's my sense most issues are
resolved in less than a week.

We've been quite reluctant to do reverts to date. They happen, but have
usually been initiated by the committer. Li-Wen and others would like to
change that to setting firm timelines; start to reset the social aspect of
reverts and document the social norms with an eye towards improving things,
either within the SVN framework, or the coming git framework.

Finally, there's a number of ways we can do this, not limited to the FCP in
question. However, nobody has stepped up to drive that initiative. It would
be great to have someone sign up to drive revisions of the various internal
developer guides to modernize their content to reflect how norms have
evolved since the were written (including this topic and the related topic
of responsiveness to @freebsd.org email and others...).

Warner