profiling kernel modules.

Mon Dec 14 19:26:35 UTC 2009

I find that the best way to profile the kernel is with pmc.  You don't
need to compile anything with a special option(other than including
the hwpmc hooks in the kernel with the HWPMC_HOOKS option) so you can
use it at any time on the same code you'll be shipping.  pmc does
statistical profiling; it uses whatever performance monitoring
counters are provided by the hardware.  It has a pretty low overhead,
especially compared with other profiling techniques.  It's really easy
to use, too:

1) If hwpmc is not compiled into your kernel, kldload hwpmc
2) Run pmcstat to begin taking samples(make sure that whatever you are
profiling is busy doing work first!):

pmcstat -S unhalted-cycles -O /tmp/samples.out

The -S option specifies what event you want to use to trigger
sampling.  The unhalted-cycles is the best event to use if your
hardware supports it; pmc will take a sample every 64K non-idle CPU
cycles, which is basically equivalent to sampling based on time.  If
the unhalted-cycles event is not supported by your hardware then the
instructions event will probably be the next best choice(although it's
nowhere near as good, as it will not be able to tell you, for example,
if a particular function is very expensive because it takes a lot of
cache misses compared to the rest of your program).  One caveat with
the unhalted-cycles event is that time spent spinning on a spinlock or
adaptively spinning on a MTX_DEF mutex will not be counted by this
event, because most of the spinning time is spent executing an hlt
instruction that idles the CPU for a short period of time.

Modern Intel and AMD CPUs offer a dizzying array of events.  They're
mostly only useful if you suspect that a particular kind of event is
hurting your performance and you would like to know what is causing
those events.  For example, if you suspect that data cache misses are
causing you problems you can take samples on cache misses.
Unfortunately on some of the newer CPUs(namely the Core2 family,
because that's what I'm doing most of my profiling on nowadays) I find
it difficult to figure out just what event to use to profile based on
cache misses.  man pmc will give you an overview of pmc, and there are
manpages for every CPU family supported(eg man pmc.core2)

3) After you've run pmcstat for "long enough"(a proper definition of
long enough requires a statistician, which I most certainly am not,
but I find that for a busy system 10 seconds is enough), Control-C it
to stop it*.  You can use pmcstat to post-process the samples into
human-readable text:

pmcstat -R /tmp/samples.out -G /tmp/graph.txt

The graph.txt file will show leaf functions on the left and their
callers beneath them, indented to reflect the callchain.  It's not too
easy to describe and I don't have sample output available right now.

Another interesting tool for post-processing the samples is
pmcannotate.  I've never actually used the tool before but it will
annotate the program's source to show which lines are the most
expensive.  This of course needs unstripped modules to work.  I think
that it will also work if the GNU "debug link" is in the stripped
module pointing to the location of the file with symbols.

* Here's a tip I picked up from Joseph Koshy's blog: to collect
samples for a fixed period of time(say 1 minute), have pmcstat run the
sleep command:

pmcstat -S unhalted-cycles -O /tmp/samples.out sleep 60