Interrupt routine usage not shown by top in 8.0
Barney Cordoba
barney_cordoba at yahoo.com
Fri Mar 13 07:34:47 PDT 2009
--- On Thu, 3/12/09, Scott Long <scottl at samsco.org> wrote:
> From: Scott Long <scottl at samsco.org>
> Subject: Re: Interrupt routine usage not shown by top in 8.0
> To: barney_cordoba at yahoo.com
> Cc: current at freebsd.org
> Date: Thursday, March 12, 2009, 8:35 PM
> Barney Cordoba wrote:
> >
> >
> >
> > --- On Thu, 3/12/09, Scott Long
> <scottl at samsco.org> wrote:
> >
> >> From: Scott Long <scottl at samsco.org>
> >> Subject: Re: Interrupt routine usage not shown by
> top in 8.0
> >> To: barney_cordoba at yahoo.com
> >> Cc: current at freebsd.org
> >> Date: Thursday, March 12, 2009, 7:42 PM
> >> Barney Cordoba wrote:
> >>> I'm fireing 400Kpps at a udp blackhole
> port.
> >> I'm getting 6000 interrupts
> >>> per second on em3:
> >>>
> >>> testbox# vmstat -i; sleep 1; vmstat -i
> >>> interrupt total
> rate
> >>> irq1: atkbd0 1
> 0
> >>> irq6: fdc0 1
> 0
> >>> irq17: uhci1+ 2226
> 9
> >>> irq18: uhci2 ehci+ 9
> 0
> >>> cpu0: timer 470507
> 1993
> >>> irq256: em0 665
> 2
> >>> irq259: em3 1027684
> 4354
> >>> cpu1: timer 470272
> 1992
> >>> cpu3: timer 470273
> 1992
> >>> cpu2: timer 470273
> 1992
> >>> Total 2911911
> 12338
> >>>
> >>> interrupt total
> rate
> >>> irq1: atkbd0 1
> 0
> >>> irq6: fdc0 1
> 0
> >>> irq17: uhci1+ 2226
> 9
> >>> irq18: uhci2 ehci+ 9
> 0
> >>> cpu0: timer 472513
> 1993
> >>> irq256: em0 668
> 2
> >>> irq259: em3 1033703
> 4361
> >>> cpu1: timer 472278
> 1992
> >>> cpu3: timer 472279
> 1992
> >>> cpu2: timer 472279
> 1992
> >>> Total 2925957
> 12345
> >>>
> >>>
> >>> top -SH shows:
> >>>
> >>> PID STATE C TIME CPU COMMAND
> >>> 10 CPU3 3 7:32 100.00% idle
> >>> 10 CPU2 2 7:32 100.00% idle
> >>> 10 RUN 0 7:31 100.00% idle
> >>> 10 CPU1 1 7:31 100.00% idle
> >>>
> >>> This implies that CPU usage is substantially
> >> under-reported in general
> >>> by the system. Note that I've modified
> >> em_irq_fast() to call em_handle_rxtx() directly
> rather than
> >> scheduling a task to illustrate
> >>> the problem
> >>>
> >> With unmodified code, what do you see? Are you
> sending
> >> valid UDP frames with valid checksums and a valid
> port, or
> >> is everything that you're blasting at the
> interface
> >> getting dropped right away? Calling
> em_handle_rxtx()
> >> directly will cause a very quick panic once you
> start
> >> handling real traffic and you encounter a lock.
> >>
> >> Scott
> >
> > I think you're mistaken. I'm also accessing
> the system via an em port
> > (and running top) and em_handle_rxtx() is self
> contained lock-wise. The taskqueue doesn't obtain a lock
> before calling the routine.
> >
>
> I understand perfectly how the code works, as I wrote it.
> While there are no locks in the RX path of the driver, there
> are certainly locks higher up in the network stack RX path.
> You're not going to hit them in
> your test, but in the real world you will.
>
> > As I mentioned, they're being dumped into a udp
> blackhole, which implies
> > that I have udp.blackhole set and the port is unused.
> I can see the packets hit the udp socket so its working as
> expected:
> >
> > 853967872 dropped due to no socket
> >
> > With unmodified code, the tasq shows 25% usage or so.
> >
> > I'm not sure what the point of your criticism for
> what clearly is a test.
> > Are you implying that the system can receive 400K pps
> with 6000 ints/sec
> > and record 0% usage because of a coding imperfection?
> Or are you implying
> > that the 25% usage is all due to launching tasks
> unnecessarily and process switching?
>
> Prior to FreeBSD 5, interrupt processing time was counted
> in the %intr stat. With FreeBSD 5 and beyond, most
> interrupts moved to full processing contexts called
> ithreads, and the processing time spent in the ithread was
> counted in the %intr stat. The time spent in low-level
> interrupts was merely counted against the process that got
> interrupted.
> This wasn't a big deal because low-level interrupts
> were only used to launch ithreads and to process low-latency
> interrupts for a few drivers.
> Moving to the taskq model breaks this accounting model.
>
> What's happening in your test is that the system is
> almost completely idle, so the only thing that is being
> interrupted by the low-level if_em handler is the cpu idle
> thread. Since you're also bogusly bypassing the
> deferral to the taskq, all stack processing is also
> happening in this
> low-level context, and it's being counted against the
> CPU idle thread.
> However, the process accounting code knows not to charge
> idle thread
> time against the normal stats, because doing so would
> result in the
> system always showing 100% busy. So your test is
> exploiting this;
> you're stealing all of your cycles from the idle
> threads, and they
> aren't being accounted for because it's hard to
> know when the idle
> thread is having its cycles stolen.
>
> So no, 25% of a CPU isn't going to "launching
> tasks unnecessarily and
> process switching." It's going to processing 400k
> packets/sec off of
> the RX ring and up the stack to the UDP layer. I think
> that if you
> studied how the code worked, and devised more useful
> benchmarks, you'd
> see that the taskq deferral method is usually a significant
> gain in
> performance over polling or simple ithreads. There is
> certainly room
> for more improvement, and my taskq scheme isn't the
> only way to get
> good performance, but it does work fairly well.
Its difficult to have "better benchmarks" when the system being tested
doesn't have accounting that works. My test is designed to isolate the
driver receive function in a controlled way. So it doesn't much matter
whether the data is real or not, as long as the tests generate a
consistent load.
The only thing obviously "bogus" is that FreeBSD is launching 16,000
tasks per second (an interrupt plus taskqueue task 8000
times per second), plus 2000 timer interrupts and reporting 0% cpu
usage. So I'm to assume that the system will never show 100% usage
as the entire overhead of the scheduler is not accounted for?
Calling handle_rxtx was a timesaver to determine the overhead of
forcing 8000 context switches per second (16000 in a router setup) for
apparently no reason. Since the OS doesn't account for these, there seems
no way to make this determination. Its convenient to say it works well
or better than something else when there is no way to actually find out
via measurement. I don't see how launching 8000 tasks per second could
be faster than not launching 8000 tasks per second, but I'm also not
up on the newest math.
Since you know how things work better than any of us regular programmers,
if you could please answer these questions it would save a lot of time
and may result in better drivers for us all:
1) MSIX interrupt routines readily do "work" and pass packets up the IP
stack, while you claim that MSI interrupts cannot? Please explain the
locking differences between MSI and MSIX, and what locks may be
encountered by an MSI interrupt routine with "real traffic" that will
not be a problem for MSIX or taskqueue launched tasks. Its certainly not obvious from any code or docs that I've seen.
2) the bge and fxp (and many other) drivers happily pass traffic up the
IP stack directly from their interrrupt routines, so why is it bogus for em to do so? And why do these drivers not use the taskqueue
approach that you claim is superior?
2b) Does this also imply that systems with bge or network drivers that
do the "work" in the interrupt handler will yield completely bogus
cpu usage numbers?
3) The em driver drops packets well before 100% cpu usage is realized.
Of course I'm relying on wrong cpu usage stats, so I may be mistaken.
Is there a way (or what is the preferred way) to increase the priority
of a task relative to other system processes (rather than relative to
tasks in the queue) so that packets can avoid being dropped while the
system runs other, non-essential tasks?
3b) Is there a way to lock down a task such as a NIC receive task to
give absolute priority or exclusive use of a cpu? The goal is to make
certain that the task doesn't yield before it completes some minimum
amount of work.
The reason I'm doing what you consider "bogus" is to get a handle on
various overheads, cache trade offs of spreading across cpus, etc. So
please don't berate me too badly for being a crappy programmer, as I
actually do know what I'm doing. One problem is the lack of documention
so much of the learning has to be done by trial and error. If there's
a document on the 8.0 scheduler I'm sure many of us would like to
see it. In my world, working "fairly well" isn't good enough, and I don't
take anyone's word that something is better if they can't demonstrate it
with actual numbers that prove it, particularly when the claim defies
logic. Most people do benchmarks completely wrong. A driver's efficiency
is measured by how much of the cpu it uses to complete a particular
workload. A good driver will happily trade off some per-connection
latency for a 20% increase in overall efficiency. Or at least make it
tunable for various environments.
Its my view that it would be better to just suck packets out of the ring
and queue them for upper layers, but I dont yet have a handle on the trade
offs. Currently the system drops too many packets unnecessarily at
extremely high load.
BTW, I ran netperf and a fetch loop overnight ("real" data) routed by a
machine with the "bogus" em setup without encountering any panics or
data loss.
Barney
More information about the freebsd-current
mailing list