Simpler SMP-friendly callout/timeout semantics

Wed Nov 17 15:22:58 PST 2004

The current callout/timeout API supports Giant-locked and MP-safe
usage options, but in both cases it is very difficult to use the
API in a way that is free of race conditions and complications.
Below is an attempt to extend that API to offer much simpler semantics
to many users while not limiting its usefulness or performance.

A summary of the difficulties faced when using the current callout
API can be found at:

    http://people.freebsd.org/~iedowse/callout/callout_current.txt

This subject has been discussed a few times before (e.g, search for
"Timeout and SMP race"). One of the suggestions, made by Archie
Cobbs, was to supply a mutex when the callout is created and have
the callout code use that to avoid many of the races. This is what
I've attempted to do. The patch below adds a new API function:

	callout_init_mtx(struct callout *c, struct mtx *mtx, int flags);

If a non-NULL `mtx' is supplied, then the callout system will acquire
that mutex before invoking the callout handler. Normally the handler
must return with the mutex still locked, but if the CALLOUT_RETURNUNLOCKED
flag is supplied, then the handler should unlock the mutex itself
before returning.

Since the callout system is acquiring the mutex itself, it can check
if the callout has been cancelled while holding the mutex and before
invoking the handler or dereferencing the callout function pointer
or argument. When combined with a requirement of holding the mutex
during calls to callout_stop() and callout_reset(), this entirely
eliminates the races, so callout_stop() and callout_reset() reliably
do what you'd expect.

The semantics of callouts initialised with the new function are:

 o When calling callout_stop() or callout_reset() on the callout,
   the specified mutex must be held.

 o Calls to callout_stop() and callout_reset() guarantee that the
   handler will not be invoked even if it has already been de-queued
   by softclock(). There are no races since the mutex is held by
   the caller of callout_stop() or callout_reset(). When the
   CALLOUT_RETURNUNLOCKED flag is used, it is however possible that
   the handler function is still running, but the because the caller
   holds the mutex, it is known that the handler, if active, has
   at least got past the mtx_unlock(). These functions do not sleep.

 o Before destroying the specified mutex, callout_drain() must be
   called. Since callout_drain() may sleep, the mutex may not be
   held at this time. If the callout has not been cancelled already,
   then it might be invoked while callout_drain() executes. To avoid
   this, call callout_stop() before releasing the mutex and then
   call callout_drain() afterwards. The call to callout_drain() is
   necessary to ensure that softclock() no longer references the
   mutex (e.g. it might still be waiting to acquire it).

The patch also changes the semantics of all existing Giant-locked
callouts to match the above with a mutex of Giant implicitly
specified. This means that Giant-locked users MUST hold Giant while
calling callout_stop(), callout_reset(), timeout() and untimeout(),
which will require a number of changes in various parts of the
kernel. It also means that fully Giant-locked code gets the race-free
semantics with no source changes.

The new API function handles some cases very well:
 - Code that uses just one mutex per instance, such as many device
   drivers.
 - Giant-locked code because again there is just one mutex involved.

However there are more complex uses of the callout API that currently
use more than one mutex due to lock ordering requirements. A good
example is the TCP code, where the timeout routines can acquire a
global TCP mutex before the per-connection mutex, so this new scheme
on its own cannot be used directly. Maybe it is possible to rework
the TCP code to avoid this, or even extend the callout API further
so that an alternative locking routine can be supplied if that
helps.

Do people think that a change like this is worth attempting? Are
there further improvements that could be made or any other suggestions
or comments?

Ian

Additional stuff:

 Patch to make the "re" driver use the new API (currently it uses
 timeout/untimeout without locking Giant):

   http://people.freebsd.org/~iedowse/callout/callout_re.diff


 Patch that attempts to close a number of timeout races in the
 ATA code using the new API:

   http://people.freebsd.org/~iedowse/callout/callout_ata.diff


 Finally the callout patch itself:


Index: sys/kern/kern_timeout.c
===================================================================
RCS file: /dump/FreeBSD-CVS/src/sys/kern/kern_timeout.c,v
retrieving revision 1.91
diff -u -r1.91 kern_timeout.c

--- sys/kern/kern_timeout.c	6 Aug 2004 21:49:00 -0000	1.91
+++ sys/kern/kern_timeout.c	17 Nov 2004 02:31:28 -0000
@@ -53,6 +53,9 @@
 static int avg_gcalls;
 SYSCTL_INT(_debug, OID_AUTO, to_avg_gcalls, CTLFLAG_RD, &avg_gcalls, 0,
     "Average number of Giant callouts made per softclock call. Units = 1/1000");
+static int avg_mtxcalls;
+SYSCTL_INT(_debug, OID_AUTO, to_avg_mtxcalls, CTLFLAG_RD, &avg_mtxcalls, 0,
+    "Average number of mtx callouts made per softclock call. Units = 1/1000");
 static int avg_mpcalls;
 SYSCTL_INT(_debug, OID_AUTO, to_avg_mpcalls, CTLFLAG_RD, &avg_mpcalls, 0,
     "Average number of MP callouts made per softclock call. Units = 1/1000");
@@ -80,6 +83,12 @@
  *                     If curr_callout is non-NULL, threads waiting on
  *                     callout_wait will be woken up as soon as the 
  *                     relevant callout completes.
+ *   curr_cancelled  - Changing to 1 with both callout_lock and c_mtx held
+ *                     guarantees that the current callout will not run.
+ *                     The softclock() function sets this to 0 before it
+ *                     drops callout_lock to acquire c_mtx, and it calls
+ *                     the handler only if curr_cancelled still 0 when
+ *                     c_mtx is successfully acquired.
  *   wakeup_ctr      - Incremented every time a thread wants to wait
  *                     for a callout to complete.  Modified only when
  *                     curr_callout is non-NULL.
@@ -88,6 +97,7 @@
  *                     cutt_callout is non-NULL.
  */
 static struct callout *curr_callout;
+static int curr_cancelled;
 static int wakeup_ctr;
 static int wakeup_needed;
 
@@ -181,6 +191,7 @@
 	int steps;	/* #steps since we last allowed interrupts */
 	int depth;
 	int mpcalls;
+	int mtxcalls;
 	int gcalls;
 	int wakeup_cookie;
 #ifdef DIAGNOSTIC
@@ -195,6 +206,7 @@
 #endif /* MAX_SOFTCLOCK_STEPS */
 
 	mpcalls = 0;
+	mtxcalls = 0;
 	gcalls = 0;
 	depth = 0;
 	steps = 0;
@@ -225,12 +237,14 @@
 			} else {
 				void (*c_func)(void *);
 				void *c_arg;
+				struct mtx *c_mtx;
 				int c_flags;
 
 				nextsoftcheck = TAILQ_NEXT(c, c_links.tqe);
 				TAILQ_REMOVE(bucket, c, c_links.tqe);
 				c_func = c->c_func;
 				c_arg = c->c_arg;
+				c_mtx = c->c_mtx;
 				c_flags = c->c_flags;
 				c->c_func = NULL;
 				if (c->c_flags & CALLOUT_LOCAL_ALLOC) {
@@ -242,11 +256,32 @@
 					    (c->c_flags & ~CALLOUT_PENDING);
 				}
 				curr_callout = c;
+				curr_cancelled = 0;
 				mtx_unlock_spin(&callout_lock);
-				if (!(c_flags & CALLOUT_MPSAFE)) {
-					mtx_lock(&Giant);
-					gcalls++;
-					CTR1(KTR_CALLOUT, "callout %p", c_func);
+				if (c_mtx != NULL) {
+					mtx_lock(c_mtx);
+					/*
+					 * The callout may have been cancelled
+					 * while we switched locks.
+					 */
+					if (curr_cancelled) {
+						mtx_unlock(c_mtx);
+						mtx_lock_spin(&callout_lock);
+						goto done_locked;
+					}
+					/* The callout cannot be stopped now. */
+					curr_cancelled = 1;
+
+					if (c_mtx == &Giant) {
+						gcalls++;
+						CTR1(KTR_CALLOUT, "callout %p",
+						    c_func);
+					} else {
+						mtxcalls++;
+						CTR1(KTR_CALLOUT,
+						    "callout mtx %p",
+						    c_func);
+					}
 				} else {
 					mpcalls++;
 					CTR1(KTR_CALLOUT, "callout mpsafe %p",
@@ -275,9 +310,10 @@
 					lastfunc = c_func;
 				}
 #endif
-				if (!(c_flags & CALLOUT_MPSAFE))
-					mtx_unlock(&Giant);
+				if ((c_flags & CALLOUT_RETURNUNLOCKED) == 0)
+					mtx_unlock(c_mtx);
 				mtx_lock_spin(&callout_lock);
+done_locked:
 				curr_callout = NULL;
 				if (wakeup_needed) {
 					/*
@@ -300,6 +336,7 @@
 	}
 	avg_depth += (depth * 1000 - avg_depth) >> 8;
 	avg_mpcalls += (mpcalls * 1000 - avg_mpcalls) >> 8;
+	avg_mtxcalls += (mtxcalls * 1000 - avg_mtxcalls) >> 8;
 	avg_gcalls += (gcalls * 1000 - avg_gcalls) >> 8;
 	nextsoftcheck = NULL;
 	mtx_unlock_spin(&callout_lock);
@@ -395,15 +432,26 @@
 	void	*arg;
 {
 
+	if (c->c_mtx != NULL)
+		mtx_assert(c->c_mtx, MA_OWNED);
+
 	mtx_lock_spin(&callout_lock);
-	if (c == curr_callout && wakeup_needed) {
+	if (c == curr_callout) {
 		/*
 		 * We're being asked to reschedule a callout which is
-		 * currently in progress, and someone has called
-		 * callout_drain to kill that callout.  Don't reschedule.
+		 * currently in progress.  If there is a mutex then we
+		 * can cancel the callout if it has not really started.
 		 */
-		mtx_unlock_spin(&callout_lock);
-		return;
+		if (c->c_mtx != NULL && !curr_cancelled)
+			curr_cancelled = 1;
+		if (wakeup_needed) {
+			/*
+			 * Someone has called callout_drain to kill this
+			 * callout.  Don't reschedule.
+			 */
+			mtx_unlock_spin(&callout_lock);
+			return;
+		}
 	}
 	if (c->c_flags & CALLOUT_PENDING) {
 		if (nextsoftcheck == c) {
@@ -446,13 +494,20 @@
 {
 	int wakeup_cookie;
 
+	if (!safe && c->c_mtx != NULL)
+		mtx_assert(c->c_mtx, MA_OWNED);
+
 	mtx_lock_spin(&callout_lock);
 	/*
 	 * Don't attempt to delete a callout that's not on the queue.
 	 */
 	if (!(c->c_flags & CALLOUT_PENDING)) {
 		c->c_flags &= ~CALLOUT_ACTIVE;
-		if (c == curr_callout && safe) {
+		if (c != curr_callout) {
+			mtx_unlock_spin(&callout_lock);
+			return (0);
+		}
+		if (safe) {
 			/* We need to wait until the callout is finished. */
 			wakeup_needed = 1;
 			wakeup_cookie = wakeup_ctr++;
@@ -468,6 +523,11 @@
 				cv_wait(&callout_wait, &callout_wait_lock);
 
 			mtx_unlock(&callout_wait_lock);
+		} else if (c->c_mtx != NULL && !curr_cancelled) {
+			/* We can stop the callout before it runs. */
+			curr_cancelled = 1;
+			mtx_unlock_spin(&callout_lock);
+			return (1);
 		} else
 			mtx_unlock_spin(&callout_lock);
 		return (0);
@@ -493,8 +553,29 @@
 	int mpsafe;
 {
 	bzero(c, sizeof *c);
-	if (mpsafe)
-		c->c_flags |= CALLOUT_MPSAFE;
+	if (mpsafe) {
+		c->c_mtx = NULL;
+		c->c_flags = CALLOUT_RETURNUNLOCKED;
+	} else {
+		c->c_mtx = &Giant;
+		c->c_flags = 0;
+	}
+}
+
+void
+callout_init_mtx(c, mtx, flags)
+	struct	callout *c;
+	struct	mtx *mtx;
+	int flags;
+{
+	bzero(c, sizeof *c);
+	c->c_mtx = mtx;
+	KASSERT((flags & ~CALLOUT_RETURNUNLOCKED) == 0,
+	    ("callout_init_mtx: bad flags %d", flags));
+	/* CALLOUT_RETURNUNLOCKED makes no sense without a mutex. */
+	KASSERT(mtx != NULL || (flags & CALLOUT_RETURNUNLOCKED) == 0,
+	    ("callout_init_mtx: CALLOUT_RETURNUNLOCKED with no mutex"));
+	c->c_flags = flags & CALLOUT_RETURNUNLOCKED;
 }
 
 #ifdef APM_FIXUP_CALLTODO
Index: sys/sys/callout.h
===================================================================
RCS file: /dump/FreeBSD-CVS/src/sys/sys/callout.h,v
retrieving revision 1.27
diff -u -r1.27 callout.h
--- sys/sys/callout.h	20 Apr 2004 15:49:31 -0000	1.27
+++ sys/sys/callout.h	17 Nov 2004 01:12:01 -0000
@@ -40,6 +40,8 @@
 
 #include <sys/queue.h>
 
+struct mtx;
+
 SLIST_HEAD(callout_list, callout);
 TAILQ_HEAD(callout_tailq, callout);
 
@@ -51,6 +53,7 @@
 	int	c_time;				/* ticks to the event */
 	void	*c_arg;				/* function argument */
 	void	(*c_func)(void *);	/* function to call */
+	struct mtx *c_mtx;			/* mutex to lock */
 	int	c_flags;			/* state of this entry */
 };
 
@@ -58,6 +61,7 @@
 #define	CALLOUT_ACTIVE		0x0002 /* callout is currently active */
 #define	CALLOUT_PENDING		0x0004 /* callout is waiting for timeout */
 #define	CALLOUT_MPSAFE		0x0008 /* callout handler is mp safe */
+#define	CALLOUT_RETURNUNLOCKED	0x0010 /* handler returns with mtx unlocked */
 
 struct callout_handle {
 	struct callout *callout;
@@ -75,6 +79,7 @@
 #define	callout_deactivate(c)	((c)->c_flags &= ~CALLOUT_ACTIVE)
 #define	callout_drain(c)	_callout_stop_safe(c, 1)
 void	callout_init(struct callout *, int);
+void	callout_init_mtx(struct callout *, struct mtx *, int);
 #define	callout_pending(c)	((c)->c_flags & CALLOUT_PENDING)
 void	callout_reset(struct callout *, int, void (*)(void *), void *);
 #define	callout_stop(c)		_callout_stop_safe(c, 0)