FAIL: kernel fault injection

Rui Paulo rpaulo at freebsd.org
Tue May 12 16:19:49 UTC 2009


On 11 May 2009, at 17:29, Zachary Loafman wrote:

> Arch -
>
> I'd like to contribute the kernel fault injection system that Isilon
> uses. Before contributing it, I'd like to get approval for the APIs
> involved.
>
> Testing errors is hard. Let's say you have:
>
> int foo(void) {
>  [...]
>  error = bar();
>  if (error) {
>    /* do stuff */
>  }
> }
>
> .. but some_func() can't reliably be made to fail. How do you test it?
> We added error injection macros that look like this:
>
> int foo(void) {
>  [...]
>  error = bar();
>  KFAIL_POINT_CODE(FP_KERN, bar_fails_foo, error = RETURN_VALUE);
>  if (error) {
>    /* do stuff */
>  }
> }
>
> The KFAIL_POINT_CODE macro adds a sysctl MIB that allows
> you to inject errors into the above code. For example:
>
> # sysctl fail_point.kern.bar_fails_foo=".1%return(5)"
>
> This says, ".1% of the time, evaluate the fail point code with 5 as
> the RETURN_VALUE". If this were a standard errno, you could read the
> above setting as "1/1000th of the time, pretend bar() returned EIO".
>
> We also have a few wrappers around KFAIL_POINT_CODE that essentially
> wrap common uses. For example, the above use can be shorthanded to:
>  KFAIL_POINT_ERROR(FP_KERN, bar_fails_foo, error)
>
> Currently, the sysctl parser accepts the following variants:
>  return(x) - triggers the code with RETURN_VALUE set to x
>  sleep(t) - sleep t milliseconds,
>  panic/break - panic or break into the debugger
>  print - print that the fail point was hit
>
> In addition to the commands, we have a syntax to express the
> when to evaluate those commands:
>  p%<command> - evaluate command p% of the time (example above)
>  5*<command> - evaluate command 5 times, then disable the expression
>
> And you can compound with expr1->expr2, so, e.g.:
>  5%return(5)->1%return(22):
>    5% of the time, return 5, 1% of the remaining time, return 22
>  5*return(0)->10*return(5)->1%return(19)
>    return 0 for 5 times, then 5 for 10 times, and after those,
>    return 19 1% of the time.
>  1%5*return(22):
>    1/100th of the time, return 22, but only do it 5 times total.
>
> I've also attached an ascii rendering of a (rough draft) man page that
> goes into more detail.
>
> Comments?

This is great and I would like to see this go in. I just have to minor  
modifications (possible bikeshed, but whatever):
* What about kern.fail_point instead of fail_point.kern ? This  
framework seems to be only for kernel.
* On the man page, you don't explain the 'sleep' type. Is that on  
purpose?

About the CAVEAT section on the man page (second paragraph), do you  
have any ideas to evaluate if msleep is being called on a correct  
context?

Thanks.
--
Rui Paulo

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 194 bytes
Desc: This is a digitally signed message part
Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20090512/5b4806c1/PGP.pgp


More information about the freebsd-arch mailing list