svn commit: r325734 - head/sys/amd64/amd64

Sun Nov 12 06:48:47 UTC 2017

On Sun, 12 Nov 2017, Mateusz Guzik wrote:

> Log:
>  amd64: stop nesting preemption counter in spinlock_enter
>
>  Discussed with:	jhb

This seems to break it.  i386 still has the old code.

It also moves the critical section a little, so that it is inconsistent
for enter/exit.  I think the critical section placement was harmlessly
wrong, and moving it further fixes the new bugs too.

> Modified: head/sys/amd64/amd64/machdep.c
> ==============================================================================
> --- head/sys/amd64/amd64/machdep.c	Sun Nov 12 02:34:33 2017	(r325733)
> +++ head/sys/amd64/amd64/machdep.c	Sun Nov 12 03:13:01 2017	(r325734)
> @@ -1853,9 +1853,9 @@ spinlock_enter(void)
> 		flags = intr_disable();
> 		td->td_md.md_spinlock_count = 1;
> 		td->td_md.md_saved_flags = flags;
> +		critical_enter();
> 	} else
> 		td->td_md.md_spinlock_count++;
> -	critical_enter();
> }
>
> void

The main broken case is:
- both levels initially 0
- disable interrupts
- raise spinlock count to 1
- bad race window until critical_enter() is called.  Disabling hardware
   interrupts doesn't prevent exceptions like debugger traps or NMIs.
   Debuuger trap handlers shouldn't use critical sections or (spin)
   mutexes, but NMI handlers might.  When an exception handler calls
   spinlock_enter(), this no longer gives a critical section and bad
   things like context switches occur if the handler calls critical_enter/
   exit().

The main old race is:
- as above, except the race is not so bad (I think it is harmless).
   Nested calls to spinlock_enter() can still occur, but at all levels,
   spinlock_enter() never returns without entering a critical section,
   so callers aren't affected.  I think the worst that happens is a
   when nested spinlock_exit() lowers the critical level to 0, context
   switches may occur.  This isn't a large problem since higher spinlock_
   enter() levels are still entering -- they have disabled interrupts,
   but nothing depends on this.

See r214835 for larger bugs related to the old race.  Exceptions cannot
be prevented, and r214835 defends against them only for the spinlock
context.

Defending against preemption is even easier.  Just move the critical_enter()
to before disabling interrupts.

> @@ -1865,11 +1865,12 @@ spinlock_exit(void)
> 	register_t flags;
>
> 	td = curthread;
> -	critical_exit();
> 	flags = td->td_md.md_saved_flags;
> 	td->td_md.md_spinlock_count--;
> -	if (td->td_md.md_spinlock_count == 0)
> +	if (td->td_md.md_spinlock_count == 0) {
> +		critical_exit();
> 		intr_restore(flags);
> +	}
> }
>
> /*

This used to call critical_exit() at the beginning.  This corresponds to
calling critical_enter() at the end of spinlock_enter(), and gives the
same race.

Now it calls critical_exit() after lowering the spinlock count.  This closes
only half of the old race window.

After moving the call to critical_enter() earlier, moving the call to
critical_exit() later to match closes all of the old race window.

The new bug only affects spinlock entry.  For exit, there is just the old
race with a reduced window.

I don't like this change.  The nested counting is easier to understand,
and the nested case is very rare and the critical section part of it is
very efficient (then critical_exit() is just lowering the level).  Old
versions of these functions were written to try to reduce branches and
other bad scheduling for the usual non-nested case.  Perhaps this isn't
done right.  It is intentionally not obfuscated with __predict_*().
spinlock_exit() ended up without the optimization(?) of avoiding the
decrement in the usual case, where spinlock_enter() uses special 0/1
logic for the usual case and needs extra code for the increment in the
unusual case.  Since the slow parts are probably the critical_*() function
calls and the flags changes, perhaps those should be optimized instead.
The old version tried to do this for the function calls by doing them
unconditionally.  This should have worked best for the call in
critical_exit() since the call is first (but should be last).  The order
is critical, and the flags changes use volatile asms which force the order
more than elsewhere.

I think the nested case is only for recursive spinlocks.  So NMI handlers
should not use any spinlocks and the new bug is small (NMI handlers should
not use non-recursive spinlocks since they would deadlock, and should not
use recursive spinlocks since they don't work).  NMI handlers aren't that
careful.  They call printf(), and even the message buffer has been broken
to use non-recursive spinlocks.

Bruce