6.0-BETA2: taskqueue_drain for if_xl.c:2796

Mon Aug 15 06:23:29 GMT 2005

John Baldwin wrote:
> On Thursday 11 August 2005 07:07 pm, Scott Long wrote:
> 
>>John Baldwin wrote:
>>
>>>On Thursday 11 August 2005 05:41 pm, John Baldwin wrote:
>>>
>>>>On Thursday 11 August 2005 04:09 pm, Joerg Pulz wrote:
>>>>
>>>>>Hi,
>>>>>
>>>>>with a fresh installed 6.0-BETA2 i get this when xl(4) gets configured
>>>>>at the system startup.
>>>>>System is P3-800MHz SMP. dmesg is attached.
>>>>
>>>>I'm working on fixes for this.  Ping me in a day or so for a patch.
>>>
>>>Ok, I've got a patch.  I added a taskqueue_stop() function to bring
>>>taskqueue's a bit closer inline with the callout*() API and use
>>>taskqueue_stop() in xl_stop() as it is ok to be called with locks held
>>>and doesn't block.  The xl task handler function now bails if
>>>IFF_DRV_RUNNING is clear, and I added a taskqueue_drain() in detach to
>>>make sure we were finished with the mutex and function before detach
>>>finishes.  Unfortunately, the patch is to HEAD, but you can probably get
>>>it to work on 6.x by changing if_drv_flags to if_flags and
>>>IFF_DRV_RUNNING to IFF_RUNNING on 6.x.
>>>
>>>http://www.FreeBSD.org/~jhb/patches/xl_locking
>>
>>It looks like taskqueue_stop merely removes a pending task from the
>>queue, it doesn't protect against there being a task already running
>>and/or sleeping.  I know that you're looking for the convenience of
>>being able to cancel a taskqueue without having to worry about locks,
>>but ignoring the possibility of an in-progress task is dangerous.  It's
>>incovenient, but it's the price of concurrency in the kernel.  I've
>>objected to callout_stop for the same reason.  Never the less, if you're
>>looking to have a similar API as callout_*, why not follow their model
>>and have _taskqueue_stop_safe() ?
> 
> 
> Note that I modified xl_rxeof_task to not call the task handler if 
> IFF_DRV_RUNNING is clear to handle this race.  I also just added locally 
> another check after it relocks the driver lock after if_input() in xl_rxeof()  
> to bail if the interface has been stopped as well which should handle that 
> race.  Both of these races are present even in the taskqueue_drain() case as 
> you need the task to die if taskqueue_drain() is ever going to unblock.
> 
> Also, I specifically call taskqueue_drain() after xl_stop() in xl_detach() 
> (same as I do for callouts) to make sure it really has stopped in the one 
> case where I need more assurance than just taskqueue_stop().  (The drains 
> here for both callout and taskqueue are to make sure that if the other thread 
> is blocked on our lock when we call xl_stop(), it will have a chance to get 
> it and release it so it's not contested when we go to destroy it later and to 
> ensure that the threads won't be in the text of our module.)
> 
> I didn't try to merge stop and drain for taskqueue as the current 
> taskqueue_drain() is different from callout_drain() in that callout_drain() 
> still removes the callout if possible and then waits, whereas 
> taskqueue_drain() makes no attempt currently to dequeue the task but just 
> waits on it to finish if it is either in the queue or running.  One nicer 
> thing about callouts is that you can use callout_init_mtx() which lets you 
> push the initial equivalent to the IFF_DRV_RUNNING check out into softclock() 
> since it can check to see if you are cancelled after grabbing your lock.  
> Given the fact that tasks can sleep, I think this would be less useful for 
> taskqueue as task handlers need to handle stop races around sleeps so they 
> might as well handle it during startup as well so that it is more consistent 
> that they handle it in all cases instead of just some of them.
> 
> Being able to stop separately from drain is not "bad", it's just a different 
> model, it lets you separate "fire off an event now" from "wait for ack from 
> the event" instead of forcing you to do it all at once.
> 

I still don't see what the point is of having a function that only
removes pending tasks from the queue if it doesn't also drain in
progress tasks.  Simple example:

CPU1 calls xl_stop.  The XL_LOCK gets taken.  At the same time, CPU2
calls taskqueue_swi, which pops the queue and calls xl_rxeof_task.  It
tries to take the XL_LOCK but instead sleeps (or adaptively spins)
because CPU1 already has it.  CPU1 calls taskqueue_stop, which doesn't
do anything because the task was alrady popped off.

Now, at some point, CPU1 **MUST** drop the XL_LOCK and let CPU2 continue
that task to completion.  There simply is no way around that, right?  So
why muddle the API with a function that really doesn't solve any
problems.  The argument that being able to call taskqueue_stop with
locks held is a convenience doesn't hold water.  You'll still need to
drop locks in order to be sure that a mess isn't created.

So what does that really buy you since you still have to deal with the
possible blocked task problem?  Better to just organize the code so that
you can call taskqueue_dequeue with locks released and let the problem
be dealt with in a consistent way, rather than encourage people roll
their own dequeueing logic (best case) or become careless with
parallelization (worse case).  I think that all that you are doing is
finding a way to cheat around a real synchronization problem.  There are
a lot of network drivers in the tree that are notorious for
fundamentally poor locking design.  I'd really rather see them be fixed
to conform to good APIs, rather than add marginal APIs that paper over
the problems.

Scott