dev.bce.X.com_no_buffers increasing and packet loss

Tue Mar 9 22:30:54 UTC 2010

> What's the traffic look like?  Jumbo, standard, short frames?  Any
> good ideas on profiling the code?  I haven't figured out how to use
> the CPU TSC but there is a free running timer on the device that
> might be usable to calculate where the driver's time is spent.
>
> Dave

In my experience hwpmc is the best and easiest way to profile anything
on FreeBSD.  Here's something I sent to a different thread a couple of
months ago explaining how to use it:

1) If device hwpmc is not compiled into your kernel, kldload hwpmc(you
will need the HWPMC_HOOKS option in either case)
2) Run pmcstat to begin taking samples(make sure that whatever you are
profiling is busy doing work first!):

pmcstat -S unhalted-cycles -O /tmp/samples.out

The -S option specifies what event you want to use to trigger
sampling.  The unhalted-cycles is the best event to use if your
hardware supports it; pmc will take a sample every 64K non-idle CPU
cycles, which is basically equivalent to sampling based on time.  If
the unhalted-cycles event is not supported by your hardware then the
instructions event will probably be the next best choice(although it's
nowhere near as good, as it will not be able to tell you, for example,
if a particular function is very expensive because it takes a lot of
cache misses compared to the rest of your program).  One caveat with
the unhalted-cycles event is that time spent spinning on a spinlock or
adaptively spinning on a MTX_DEF mutex will not be counted by this
event, because most of the spinning time is spent executing an hlt
instruction that idles the CPU for a short period of time.

Modern Intel and AMD CPUs offer a dizzying array of events.  They're
mostly only useful if you suspect that a particular kind of event is
hurting your performance and you would like to know what is causing
those events.  For example, if you suspect that data cache misses are
causing you problems you can take samples on cache misses.
Unfortunately on some of the newer CPUs(namely the Core2 family,
because that's what I'm doing most of my profiling on nowadays) I find
it difficult to figure out just what event to use to profile based on
cache misses.  man pmc will give you an overview of pmc, and there are
manpages for every CPU family supported(eg man pmc.core2)

3) After you've run pmcstat for "long enough"(a proper definition of
long enough requires a statistician, which I most certainly am not,
but I find that for a busy system 10 seconds is enough), Control-C it
to stop it*.  You can use pmcstat to post-process the samples into
human-readable text:

pmcstat -R /tmp/samples.out -G /tmp/graph.txt

The graph.txt file will show leaf functions on the left and their
callers beneath them, indented to reflect the callchain.  It's not too
easy to describe and I don't have sample output available right now.

Another interesting tool for post-processing the samples is
pmcannotate.  I've never actually used the tool before but it will
annotate the program's source to show which lines are the most
expensive.  This of course needs unstripped modules to work.  I think
that it will also work if the GNU "debug link" is in the stripped
module pointing to the location of the file with symbols.

* Here's a tip I picked up from Joseph Koshy's blog: to collect
samples for a fixed period of time(say 1 minute), have pmcstat run the
sleep command:

pmcstat -S unhalted-cycles -O /tmp/samples.out sleep 60