FCP 20190401-ci_policy: CI policy

Fri Aug 30 16:25:28 UTC 2019

> On Aug 27, 2019, at 21:29, Li-Wen Hsu <lwhsu at freebsd.org> wrote:
> 
> It seems I was doing wrong that just changed the content of this FCP
> to "feedback", but did not send to the right mailing lists.
> 
> So I would like to make an announcement that the FCP
> 20190401-ci_policy "CI policy":
> 
> https://github.com/freebsd/fcp/blob/master/fcp-20190401-ci_policy.md
> 
> is officially in "feedback" state to hopefully receive more comments
> and suggestions, then we can move on for the next FCP state.

First off, thank you Li-Wen and Kristof for spearheading this proposal; it’s a very contentious topic with a lot of strong emotions associated with it.

As the person who has integrated a number of tests and helped manage them for a few years (along with some of the care and feeding associated with them), this task is non-trivial. In particular when issues that I filed in bugzilla are fixed quickly and linger in the tree for some time, impacting a lot of folks who might rely on build and test suite stability.

The issue, as I see it, from a CI/release perspective that the new policy attempts to define a notion of “stable”, in terms of both tests and other code; right now, stability is sort of defined on a honor system basis with the FreeBSD test suite as a litmus test of sorts to convey a sense of stability.

======

One thing that I don’t see in the proposal is the health of the “make tinderbox” target in a CI world (this is a gap in our current CI process).

Another thing that I don’t see in the proposal is about the health of head vs stable and how it relates to MFCs. I see a lot more issues occur on stable branches go unfixed for some time, in part because some fixes or enhancements haven’t been MFCed. Part of the problem I see these days is a bit of a human/resource problem: if developers can’t test their changes easily, they don’t MFC them.

This issue has caused me to do a fair amount of triage in the past when backporting changes, in order to discover potentially missing puzzle pieces in order to make my tests and code work.

======

The big issues, as I see it based on the discussions that has taken place in the thread, is around revert timing and etiquette, and dealing with unreliable tests.

First off, revert timing and etiquette: while I see the FCP as an initial framework, I am a bit concerned with the heavy handed ness of “what constitutes needing reversion”: should this be done after N consistent failures in a certain period (be they build or test)? Furthermore, why is a human involved in making this decision (apart from maybe a technical solution via automation not being available yet)?

Second off, unreliable tests:

* Unreliable tests need to be qualified not based on a single run, but a pattern of runs.

The way that this worked at Facebook is, if a test failed, it would attempt to rerun it multiple times (10 in total IIRC). If the test was consistently failing on a build, the test would be automatically disabled, and all committers in a revision range would be nagged as part of disabling those tests. This generally works because of siloization of Facebook components, but is a much harder problem to solve with FreeBSD because it is a complete
OS distribution and sometimes small seemingly disconnected changes can cause a lot of grief.

So what to do?

I suggest expanding the executors and running individuals suites instead of the whole batch of tests. While it wouldn’t fix everything and would be an expensive thing to do with our current test infrastructure, it would allow folks to better pinpoint issues and be able to get some level of coverage, as opposed to throwing all of test execution out, like a baby with the bath water.

How do we get there?
- Expand the CI executor pool.
- Provide a tool or process with which we can define test suites.
- Make spinning up executors faster: with virtual machines this is typically done by using Big Iron infrastructure clusters (e.g., ESXi clusters) and something like thin provisioning where one could start from a common image/snapshot, instead of taking a hit copying around images. Linux can do this with btrfs; we can do this with ZFS with per VM datasets, snapshotting, etc.

While this only gets part of the way to a potential solution, it is a good way to begin solving the isolation/execution problem.

* A number of tests that existed in the tree have varying quality/reliability; I agree that system level tests (of which the pf tests are one of many) are less reliable than unit/API functional tests. This is the nature of the beast of testing.

The core issue I see with the test suite as it stands, is that it mixes integration/system level tests (less deterministic) with functional/unit tests (generally more deterministic).

Using test mock frameworks would be a good technical solution to making system tests into functional/unit tests (googlemock and unittest.mock are two of many good tools I know of in this area), but we need a way to run both cases.

I can see now where some of the concern over labeling test types was a concern when I first started this work (des@/phk@ aired this concern).

Part of the technical/procedural solution to allowing commingling of tests is to go back and label the tests appropriately. I’ll send out an FCP for this sometime in the next week or two.

======

Taking a step back, as others have brought up, we’re currently hindered by tooling: we are applying a DVCS (git, hg) based technique (CI) to subversion and testing changes after they’ve hit head, instead of before they hit head.

While phabricator can partially solve this by testing upfront (we don’t enforce this; I’ve made my concerns with this not being a requirement well-known in the past), the solution is limited by bandwidth for testing, i.e., testing is an all or nothing exercise right now and building multiple toolchains/architectures takes a considerable amount of time. We could leverage cloud/distributed solutions for this (Cirrus CI, Travis if the integration existed), but this would require using github or teaching a tool how to make the appropriate REST api calls to run the tests and query the status (in progress, pass, fail, etc).

Applying labels and filtering on test suites will get us partway to a final solution from a test perspective, but a lot of work needs to be done with phabricator, etc.

We also need to have build failures with tier 1 architectures with GENERIC be a commit blocking operation. Full stop.

======

While some of the thoughts I put down aren’t complete solutions, I have subproposals that should be done/things that could be worked on before implementing the proposed CI policy. Some of the things I brought up above 

While I can’t work on it now, December break is coming up, and with it I’ll have more time to work on projects like this. I’ll put down some TODO items so I can look at tackling them during the break.

Thank you,
-Enji