Memory allocation performance
brde at optusnet.com.au
Fri Feb 1 14:53:37 PST 2008
On Fri, 1 Feb 2008, Alexander Motin wrote:
> Robert Watson wrote:
>> It would be very helpful if you could try doing some analysis with hwpmc --
>> "high resolution profiling" is of increasingly limited utility with modern
You mean "of increasingly greater utility with modern CPUs". Low resolution
kernel profiling stopped giving enough resolution in about 1990, and has
become of increasingly limited utility since then, but high resolution
kernel profiling uses the TSC or possibly a performance counter so it
has kept up with CPU speedups. Cache effects and out of order execution
are larger now, but they affect all types of profiling and still not too
bad with high resulotion kernel profiling. High resolution kernel profiling
doesn't really work with SMP, but that is not Alexander's problem since he
profiled under UP.
>> CPUs, where even a high frequency timer won't run very often. It's also
>> quite subject to cycle events that align with other timers in the system.
No, it isn't affected by either of these. The TSC timer is incremented on
every CPU cycle and the performance counters run are incremented on every
event. It is completely unaffected by other timers.
> I have tried hwpmc but still not completely friendly with it. Whole picture
> is somewhat alike to kgmon's, but it looks very noisy. Is there some "know
> how" about how to use it better?
hwpmc doesn't work for me either. I can't see how it could work as well
as high resolution kernel profiling for events at the single function
level, since it is statistics-based. With the statistics clock interrupt
rate fairly limited, it just cannot get enough resolution over short runs.
Also, it works poorly for me (with a current kernel and ~5.2 userland
except for some utilities like pmc*). Generation of profiles stopped
working for me, and it often fails with allocation errors.
> I have tried it for measuring number of instructions. But I am in doubt that
> instructions is a correct counter for performance measurement as different
> instructions may have very different execution times depending on many
> reasons, like cache misses and current memory traffic.
Cycle counts are more useful, but high resolution kernel profiling can do
this too, with some fixes:
- update perfmon for newer CPUs. It is broken even for Athlons (takes a
2 line fix, or more lines with proper #defines and if()s).
- select the performance counter to be used for profiling using
sysctl machdep.cputime_clock=$((5 + N)) where N is the number of the
performance counter for the instruction count (or any). I use hwpmc
mainly to determine N :-). It may also be necessary to change the
kernel variable cpu_clock_pmc_conf. Configuration of this is unfinished.
- use high resolution kernel profiling normally. Note that switching to
a perfmon counter is only permitted of !SMP (since it is too unsupported
under SMP to do more than panic if permitted under SMP), and that the
switch loses the calibration of profiling. Profiling normally
compensates for overheads of the profiling itself, and the compensation
would work almoost perfectly for event counters, unlike for time-related
counters, since the extra events for profiling aren't much affected by
> I have tried to use
> tsc to count CPU cycles, but got the error:
> # pmcstat -n 10000 -S "tsc" -O sample.out
> pmcstat: ERROR: Cannot allocate system-mode pmc with specification "tsc":
> Operation not supported
> What have I missed?
This might be just because the TSC really is not supported. Many things
require an APIC for hwpmc to support them.
I get errors allocation like this for operations that work a few times
> I am now using Pentium4 Prescott CPU with HTT enabled in BIOS, but kernel
> built without SMP to simplify profiling. What counters can you recommend me
> to use on it for regular time profiling?
Try them all :-). From userland first with an overall count, since looking
at the details in gprof output takes too long (and doesn't work for me with
hwpmc anyway). I use scripts like the following to try them all from
c="ttcp -n100000 -l5 -u -t epsplex"
while test $ctr -lt 256
ctr1=$(printf "0x%02x\n" $ctr)
case $ctr1 in
0x20) src=k8-ls-segment-register-load;; # XXX
0x24) src=k8-ls-locked-operation;; # XXX
0x42) src=kx-dc-refills-from-l2;; # XXX
0x43) src=kx-dc-refills-from-system;; # XXX
0x44) src=kx-dc-writebacks;; # XXX
0x7d) src=k8-bu-internal-l2-request;; # XXX
0x7e) src=k8-bu-fill-request-l2-miss;; # XXX
0x7f) src=k8-bu-fill-into-l2;; # XXX
0xe4) src=k8-nb-memory-controller-bypass-saturation;; # XXX
0xe5) src=k8-nb-sized-commands;; # XXX
0xec) src=k8-nb-probe-result;; # XXX
case $src in
k8-*) ctr=$(($ctr + 1)); continue;;
*unknown-*) ctr=$(($ctr + 1)); continue;;
echo "# s/$src "
perfmon -c "$c" -ou -l 1 $ctr |
egrep -v '(^total: |^mean: |^clocks \(at)' | sed -e 's/1: //'
ctr=$(($ctr + 1))
for i in \
pmcstat -s $i sleep 1 2>&1 | sed -e 's/^ *//' -e 's/ */ /' \
-e 's/ *$//' -e 's/\/00\/k8-/\/k8-/'
for i in \
pmcstat -s $i sleep 1 2>&1 |
sed -e 's/^ *//' -e 's/ */ /' -e 's/ *$//' -e 's/k7/kx/'
"runpm" tries up to all 256 perfomance counters, with names like the
hwpmc ones. k7 means AthlonXP only; k8 means Athlon64 only; kx means
both, but many kx's don't really work or are not documented for both.
A few like kx-fr-retired-near-returns-mispredicted (?) are not documented
for AXP but seem to work and are useful.
runpmc tries the documented A64 counters. runpmc7 tries the documented
AXP counters. hwpmc is less useful than perfmon here since it doesn't
support using the undocumented counters. There is a pmc* option that
prints all the counters in the above lists. I checked that they are
almost precisely the documented (in Athlon optimization manuals) ones.
Names are unfortunately inconsistent between k7 and k8 in some cases,
following inconsistencies in the documentation.
I don't know anything about Pentium counters except what is in source
gprof output for the mumble perfmon counter (kx-dc-misses?) while sending
100000 tiny packets using ttcp -t looks like this (after fixing the
granularity: each sample hit covers 16 byte(s) for 0.00% of 2.81 seconds
% cumulative self self total
time seconds seconds calls us/call us/call name
11.0 0.308 0.308 100083 3 24 syscall 
10.8 0.613 0.305 200012 2 2 rn_match 
4.4 0.738 0.125 100019 1 1 _bus_dmamap_load_buffer 
4.3 0.859 0.121 300107 0 0 generic_copyin 
4.0 0.973 0.114 100006 1 9 ip_output 
3.8 1.079 0.106 100006 1 4 ether_output 
3.7 1.182 0.103 100007 1 1 fgetsock 
3.6 1.284 0.102 100006 1 2 bus_dmamap_load_mbuf 
3.6 1.385 0.101 200012 1 3 rtalloc_ign 
3.6 1.486 0.101 100083 1 25 Xint0x80_syscall 
3.6 1.587 0.101 200012 1 1 in_clsroute 
3.6 1.688 0.101 100006 1 20 sendto 
3.6 1.789 0.101 100008 1 1 in_pcblookup_hash 
3.6 1.890 0.101 100006 1 16 kern_sendit 
3.6 1.990 0.100 200012 1 2 in_matroute 
3.6 2.091 0.100 100748 1 1 doreti 
3.6 2.191 0.100 100007 1 2 getsockaddr 
I would like to be able to do this with hwpmc but don't see how it can.
Only (non-statistical) counting at every function call and return can
give precise counts like the above. However, non-statistical counting
is better for some things.
Back to the original problem. Uma allocation overhead shows up in TSC
profiles of ttcp, but is just one of too many things that take a while.
There are about function calls, each taking about 1%:
% granularity: each sample hit covers 16 byte(s) for 0.00% of 0.86 seconds
% % cumulative self self total
% time seconds seconds calls ns/call ns/call name
% 44.9 0.388 0.388 0 100.00% mcount 
% 20.9 0.569 0.180 0 100.00% mexitcount 
% 8.0 0.638 0.069 0 100.00% cputime 
% 1.8 0.654 0.016 0 100.00% user 
% 1.6 0.668 0.014 100006 143 1051 udp_output 
% 1.5 0.681 0.013 100006 133 704 ip_output 
% 1.3 0.692 0.011 300120 37 37 copyin 
% 1.2 0.702 0.010 100006 100 1360 sosend_dgram 
% 0.9 0.710 0.008 200012 39 39 rn_match 
% 0.9 0.718 0.007 300034 25 25 memcpy 
% 0.8 0.725 0.007 200103 36 58 uma_zalloc_arg 
% 0.8 0.732 0.007 100090 68 1977 syscall 
All the times seem reasonable. Without profiling, sendto() and overheads
takes about 1700 nsec in -current and about 1600 nsec in my version
of 5.2. (This is for -current. The 100 nsec difference is very hard
to understand in detail.) With high resolution kernel profiling, sendto()
and overheads take about 8600 nsec here. Profiling has subtracted its
own overheads and the result of 1977 nsec for syscall is consistent with
syscall() taking a bit less that 1700 nsec when not looked at. (Profiling
only subtracts its best-case overheads. Its runtime overheads must be
larger due to cache effects, and if these are very large then we cannot
trust the compensation. Since it compensated from 8600 down to about 1977,
it has clearly down the compensation almost right. The compensation is
delicate when there are a lot of functions taking ~20 nsec since the profiling
overhead per function call is 82 nsec.
% 0.8 0.738 0.007 200012 33 84 rtalloc1 
% 0.8 0.745 0.006 100006 65 139 bge_encap 
% 0.7 0.751 0.006 100006 62 201 bge_start_locked 
% 0.6 0.757 0.006 200075 28 28 bzero 
% 0.6 0.761 0.005 100006 48 467 ether_output 
% 0.6 0.766 0.005 100006 48 192 m_uiotombuf 
% 0.5 0.771 0.005 200100 23 45 uma_zfree_arg 
% 0.5 0.775 0.005 100006 46 46 bus_dmamap_load_mbuf_sg 
% 0.5 0.780 0.004 100028 45 132 malloc 
% 0.5 0.784 0.004 200012 20 104 rtalloc_ign 
This is hard to optimize.
uma has shown up as taking 58 nsec for uma_zalloc_arg() (including what it
calls) and 45 nsec for uma_zfree_arg(). This is on a 2.2GHz A64. Anything
larger than that might be a profiling error. But thes allocations here are
tiny -- maybe large allocations cause cache methods.
I was able to optimize away most the allocation overheads in sendto()
be allocating the sockaddr on the stack, but this made little difference
overall. (It reduces dynamic allocations per packet from 2 to 1. Both
allocations use malloc() so they are a bit slower than pure uma. BTW,
switching from mbuf-based allocation to malloc() in getsockaddr() etc.
long ago cost 10 usec on a P1/133. A loss of 10000 nsec makes the overhead
of 200 nsec for malloc now look tiny.)
Remember I said that differences of 100 nsec are hard to understand?
It is also not easy to understand why eliminating potentially 100 nsec
of malloc() overhead makes almost no difference overall. The 100 nsec
gets distributed differently, or maybe the profiling really was wrong
for the malloc() part.
Reads of the TSC are excuted possibly-out-of-order on some CPUs. This
doesn't seem to have much effect on the accuracy of high resolution
(TSC) kernel profiling on at least Athlons. rdtsc takes only 12 cycles
on AXP-A64. I think it takes much longer on Pentiums. On Phenom it
takes ~42 cycles (pessimized to share it across CPUs :-(). With it
taking much longer than the functions that it profiles, the compensation
might become too fragile.
More information about the freebsd-performance