two problems in dev/e1000/if_lem.c::lem_handle_rxtx()

Sat Jan 19 16:16:59 UTC 2013

On Saturday, January 19, 2013 10:47:30 AM Barney Cordoba wrote:
> --- On Fri, 1/18/13, John Baldwin <jhb at freebsd.org> wrote:
> > From: John Baldwin <jhb at freebsd.org>
> > Subject: Re: two problems in dev/e1000/if_lem.c::lem_handle_rxtx()
> > To: freebsd-net at freebsd.org
> > Cc: "Barney Cordoba" <barney_cordoba at yahoo.com>, "Adrian Chadd"
> > <adrian at freebsd.org>, "Luigi Rizzo" <rizzo at iet.unipi.it> Date: Friday,
> > January 18, 2013, 11:49 AM
> > On Friday, January 18, 2013 9:30:40
> > 
> > am Barney Cordoba wrote:
> > > --- On Thu, 1/17/13, Adrian Chadd <adrian at freebsd.org>
> > 
> > wrote:
> > > > From: Adrian Chadd <adrian at freebsd.org>
> > > > Subject: Re: two problems in
> > 
> > dev/e1000/if_lem.c::lem_handle_rxtx()
> > 
> > > > To: "Barney Cordoba" <barney_cordoba at yahoo.com>
> > > > Cc: "Luigi Rizzo" <rizzo at iet.unipi.it>,
> > 
> > freebsd-net at freebsd.org
> > 
> > > > Date: Thursday, January 17, 2013, 11:48 AM
> > > > There's also the subtle race
> > > > condition in TX and RX handling that
> > > > re-queuing the taskqueue gets around.
> > > > 
> > > > Which is:
> > > > 
> > > > * The hardware is constantly receiving frames ,
> > 
> > right until
> > 
> > > > you blow
> > > > the FIFO away by filling it up;
> > > > * The RX thread receives a bunch of frames;
> > > > * .. and processes them;
> > > > * .. once it's done processing, the hardware may
> > 
> > have read
> > 
> > > > some more
> > > > frames in the meantime;
> > > > * .. and the hardware may have generated a
> > 
> > mitigated
> > 
> > > > interrupt which
> > > > you're ignoring, since you're processing frames;
> > > > * So if your architecture isn't 100% paranoid, you
> > 
> > may end
> > 
> > > > up having
> > > > to wait for the next interrupt to handle what's
> > 
> > currently in
> > 
> > > > the
> > > > queue.
> > > > 
> > > > Now if things are done correct:
> > > > 
> > > > * The hardware generates a mitigated interrupt
> > > > * The mask register has that bit disabled, so you
> > 
> > don't end
> > 
> > > > up receiving it;
> > > > * You finish your RX queue processing, and there's
> > 
> > more
> > 
> > > > stuff that's
> > > > appeared in the FIFO (hence why the hardware has
> > 
> > generated
> > 
> > > > another
> > > > mitigated interrupt);
> > > > * You unmask the interrupt;
> > > > * .. and the hardware immediately sends you the
> > 
> > MSI or
> > 
> > > > signals an interrupt;
> > > > * .. thus you re-enter the RX processing thread
> > 
> > almost(!)
> > 
> > > > immediately.
> > > > 
> > > > However as the poster(s) have said, the interrupt
> > > > mask/unmask in the
> > > > intel driver(s) may not be 100% correct, so you're
> > 
> > going to
> > 
> > > > end up
> > > > with situations where interrupts are missed.
> > > > 
> > > > The reason why this wasn't a big deal in the
> > 
> > deep/distant
> > 
> > > > past is
> > > > because we didn't used to have kernel preemption,
> > 
> > or
> > 
> > > > multiple kernel
> > > > threads running, or an overly aggressive scheduler
> > 
> > trying
> > 
> > > > to
> > > > parallelise things as much as possible. A lot of
> > > > net80211/ath bugs
> > > > have popped out of the woodwork specifically
> > 
> > because of the
> > 
> > > > above
> > > > changes to the kernel. They were bugs before, but
> > 
> > people
> > 
> > > > didn't hit
> > > > them.
> > > 
> > > I don't see the distinction between the rx thread
> > 
> > getting re-scheduled
> > 
> > > "immediately" vs introducing another thread. In fact
> > 
> > you increase missed
> > 
> > > interrupts by this method. The entire point of
> > 
> > interrupt moderation is
> > 
> > > to tune the intervals where a driver is processed.
> > > 
> > > You might as well just not have a work limit and
> > 
> > process until your done.
> > 
> > > The idea that "gee, I've been taking up too much cpu,
> > 
> > I'd better yield"
> > 
> > > to just queue a task and continue soon after doesn't
> > 
> > make much sense to
> > 
> > > me.
> > 
> > If there are multiple threads with the same priority then
> > batching the work up
> > into chunks allows the scheduler to round-robin among
> > them.  However, when a
> > task requeues itself that doesn't actually work since the
> > taskqueue thread
> > will see the requeued task before it yields the CPU. 
> > Alternatively, if you
> > force all the relevant interrupt handlers to use the same
> > thread pool and
> > instead of requeueing a separate task you requeue your
> > handler in the ithread
> > pool then you can get the desired round-robin
> > behavior.  (I have changes to
> > the ithread stuff that get us part of the way there in that
> > handlers can
> > reschedule themselves and much of the plumbing is in place
> > for shared thread
> > pools among different interrupts.)
> 
> I dont see any "round robin" effect here. You have:
> 
> Repeat:
> - Process 100 frames
> if (more)
> - Queue a Task
> 
> there's only 1 task at a time. All its really doing is yielding and
> rescheduling itself to resume the loop.

As I said above, in the current e1000 drivers using private taskqueues where 
the taskqueue thread priority is the same as the ithread priority, the round-
robin doesn't really work because the taskqueue thread doesn't yield when the 
task is rescheduled since it will see the new task and go run it instead of 
yielding.

However, I did describe an alternate setup where you can fix this.  Part of 
the key is to get various NICs to share a single logical queue of tasks.  You 
could simulate this now by having all the deferred tasks share a single 
taskqueue with a pool of tasks, but that will still not fully cooperate with 
ithreads.  To do that you have to get the interrupt handlers themselves into 
the shared taskqueue.  Some changes I have in a p4 branch allow you to do that 
by letting interrupt handlers reschedule themselves (avoiding the need for a 
separate task and preventing the task from running concurrently with the 
interrupt handler) and providing some (but not yet all) of the framework to 
allow multiple devices to share a single work queue backed by a shared pool of 
threads.

-- 
John Baldwin