Memory allocation performance

Fri Feb 1 14:53:37 PST 2008

On Fri, 1 Feb 2008, Alexander Motin wrote:

> Robert Watson wrote:
>> It would be very helpful if you could try doing some analysis with hwpmc -- 
>> "high resolution profiling" is of increasingly limited utility with modern

You mean "of increasingly greater utility with modern CPUs".  Low resolution
kernel profiling stopped giving enough resolution in about 1990, and has
become of increasingly limited utility since then, but high resolution
kernel profiling uses the TSC or possibly a performance counter so it
has kept up with CPU speedups.  Cache effects and out of order execution
are larger now, but they affect all types of profiling and still not too
bad with high resulotion kernel profiling.  High resolution kernel profiling
doesn't really work with SMP, but that is not Alexander's problem since he
profiled under UP.

>> CPUs, where even a high frequency timer won't run very often.  It's also 
>> quite subject to cycle events that align with other timers in the system.

No, it isn't affected by either of these.  The TSC timer is incremented on
every CPU cycle and the performance counters run are incremented on every
event.  It is completely unaffected by other timers.

> I have tried hwpmc but still not completely friendly with it. Whole picture 
> is somewhat alike to kgmon's, but it looks very noisy. Is there some "know 
> how" about how to use it better?

hwpmc doesn't work for me either.  I can't see how it could work as well
as high resolution kernel profiling for events at the single function
level, since it is statistics-based.  With the statistics clock interrupt
rate fairly limited, it just cannot get enough resolution over short runs.
Also, it works poorly for me (with a current kernel and ~5.2 userland
except for some utilities like pmc*).  Generation of profiles stopped
working for me, and it often fails with allocation errors.

> I have tried it for measuring number of instructions. But I am in doubt that 
> instructions is a correct counter for performance measurement as different 
> instructions may have very different execution times depending on many 
> reasons, like cache misses and current memory traffic.

Cycle counts are more useful, but high resolution kernel profiling can do
this too, with some fixes:
- update perfmon for newer CPUs.  It is broken even for Athlons (takes a
   2 line fix, or more lines with proper #defines and if()s).
- select the performance counter to be used for profiling using
   sysctl machdep.cputime_clock=$((5 + N)) where N is the number of the
   performance counter for the instruction count (or any).  I use hwpmc
   mainly to determine N :-).  It may also be necessary to change the
   kernel variable cpu_clock_pmc_conf.  Configuration of this is unfinished.
- use high resolution kernel profiling normally.  Note that switching to
   a perfmon counter is only permitted of !SMP (since it is too unsupported
   under SMP to do more than panic if permitted under SMP), and that the
   switch loses the calibration of profiling.  Profiling normally
   compensates for overheads of the profiling itself, and the compensation
   would work almoost perfectly for event counters, unlike for time-related
   counters, since the extra events for profiling aren't much affected by
   caches.

> I have tried to use 
> tsc to count CPU cycles, but got the error:
> # pmcstat -n 10000 -S "tsc" -O sample.out
> pmcstat: ERROR: Cannot allocate system-mode pmc with specification "tsc": 
> Operation not supported
> What have I missed?

This might be just because the TSC really is not supported.  Many things
require an APIC for hwpmc to support them.

I get errors allocation like this for operations that work a few times
before failing.

> I am now using Pentium4 Prescott CPU with HTT enabled in BIOS, but kernel 
> built without SMP to simplify profiling. What counters can you recommend me 
> to use on it for regular time profiling?

Try them all :-).  From userland first with an overall count, since looking
at the details in gprof output takes too long (and doesn't work for me with
hwpmc anyway).  I use scripts like the following to try them all from
userland:

runpm:
%%%
c="ttcp -n100000 -l5 -u -t epsplex"

ctr=0
while test $ctr -lt 256
do
 	ctr1=$(printf "0x%02x\n" $ctr)
 	case $ctr1 in
 	0x00)	src=k8-fp-dispatched-fpu-ops;;
 	0x01)	src=k8-fp-cycles-with-no-fpu-ops-retired;;
 	0x02)	src=k8-fp-dispatched-fpu-fast-flag-ops;;
 	0x05)	src=k8-fp-unknown-$ctr1;;
 	0x09)	src=k8-fp-unknown-$ctr1;;
 	0x0d)	src=k8-fp-unknown-$ctr1;;
 	0x11)	src=k8-fp-unknown-$ctr1;;
 	0x15)	src=k8-fp-unknown-$ctr1;;
 	0x19)	src=k8-fp-unknown-$ctr1;;
 	0x1d)	src=k8-fp-unknown-$ctr1;;
 	0x20)	src=k8-ls-segment-register-load;;	# XXX
 	0x21)	src=kx-ls-microarchitectural-resync-by-self-mod-code;;
 	0x22)	src=k8-ls-microarchitectural-resync-by-snoop;;
 	0x23)	src=kx-ls-buffer2-full;;
 	0x24)	src=k8-ls-locked-operation;;		# XXX
 	0x25)	src=k8-ls-microarchitectural-late-cancel;;
 	0x26)	src=kx-ls-retired-cflush-instructions;;
 	0x27)	src=kx-ls-retired-cpuid-instructions;;
 	0x2a)	src=kx-ls-unknown-$ctr1;;
 	0x2b)	src=kx-ls-unknown-$ctr1;;
 	0x2e)	src=k7-ls-unknown-$ctr1;;
 	0x2f)	src=k7-ls-unknown-$ctr1;;
 	0x32)	src=kx-ls-unknown-$ctr1;;
 	0x33)	src=kx-ls-unknown-$ctr1;;
 	0x36)	src=k7-ls-unknown-$ctr1;;
 	0x37)	src=k7-ls-unknown-$ctr1;;
 	0x3a)	src=kx-ls-unknown-$ctr1;;
 	0x3b)	src=kx-ls-unknown-$ctr1;;
 	0x3e)	src=k7-ls-unknown-$ctr1;;
 	0x3f)	src=k7-ls-unknown-$ctr1;;
 	0x40)	src=kx-dc-accesses;;
 	0x41)	src=kx-dc-misses;;
 	0x42)	src=kx-dc-refills-from-l2;;		# XXX
 	0x43)	src=kx-dc-refills-from-system;;		# XXX
 	0x44)	src=kx-dc-writebacks;;			# XXX
 	0x45)	src=kx-dc-l1-dtlb-miss-and-l2-dtlb-hits;;
 	0x46)	src=kx-dc-l1-and-l2-dtlb-misses;;
 	0x47)	src=kx-dc-misaligned-references;;
 	0x48)	src=kx-dc-microarchitectural-late-cancel-of-an-access;;
 	0x49)	src=kx-dc-microarchitectural-early-cancel-of-an-access;;
 	0x4a)	src=k8-dc-one-bit-ecc-error;;
 	0x4b)	src=k8-dc-dispatched-prefetch-instructions;;
 	0x4c)	src=k8-dc-dcache-accesses-by-locks;;
 	0x4d)	src=k7-dc-unknown-$ctr1;;
 	0x4e)	src=k7-dc-unknown-$ctr1;;
 	0x4f)	src=k7-dc-unknown-$ctr1;;
 	0x50)	src=kx-dc-unknown-$ctr1;;
 	0x51)	src=kx-dc-unknown-$ctr1;;
 	0x55)	src=kx-dc-unknown-$ctr1;;
 	0x56)	src=kx-dc-unknown-$ctr1;;
 	0x57)	src=kx-dc-unknown-$ctr1;;
 	0x58)	src=kx-dc-unknown-$ctr1;;
 	0x59)	src=kx-dc-unknown-$ctr1;;
 	0x5d)	src=k7-dc-unknown-$ctr1;;
 	0x5e)	src=k7-dc-unknown-$ctr1;;
 	0x5f)	src=k7-dc-unknown-$ctr1;;
 	0x64)	src=kx-bu-unknown-$ctr1;;
 	0x68)	src=kx-bu-unknown-$ctr1;;
 	0x69)	src=kx-bu-unknown-$ctr1;;
 	0x76)	src=kx-bu-cpu-clk-unhalted;;
 	0x79)	src=k8-bu-unknown-$ctr1;;
 	0x7d)	src=k8-bu-internal-l2-request;;		# XXX
 	0x7e)	src=k8-bu-fill-request-l2-miss;;	# XXX
 	0x7f)	src=k8-bu-fill-into-l2;;		# XXX
 	0x80)	src=kx-ic-fetches;;
 	0x81)	src=kx-ic-misses;;
 	0x82)	src=kx-ic-refill-from-l2;;
 	0x83)	src=kx-ic-refill-from-system;;
 	0x84)	src=kx-ic-l1-itlb-misses;;
 	0x85)	src=kx-ic-l1-l2-itlb-misses;;
 	0x86)	src=k8-ic-microarchitectural-resync-by-snoop;;
 	0x87)	src=kx-ic-instruction-fetch-stall;;
 	0x88)	src=kx-ic-return-stack-hit;;
 	0x89)	src=kx-ic-return-stack-overflow;;
 	0xc0)	src=kx-fr-retired-instructions;;
 	0xc1)	src=kx-fr-retired-ops;;
 	0xc2)	src=kx-fr-retired-branches;;
 	0xc3)	src=kx-fr-retired-branches-mispredicted;;
 	0xc4)	src=kx-fr-retired-taken-branches;;
 	0xc5)	src=kx-fr-retired-taken-branches-mispredicted;;
 	0xc6)	src=kx-fr-retired-far-control-transfers;;
 	0xc7)	src=kx-fr-retired-resync-branches;;
 	0xc8)	src=kx-fr-retired-near-returns;;
 	0xc9)	src=kx-fr-retired-near-returns-mispredicted;;
 	0xca)	src=kx-fr-retired-taken-branches-mispred-by-addr-miscompare;;
 	0xcb)	src=k8-fr-retired-fpu-instructions;;
 	0xcc)	src=k8-fr-retired-fastpath-double-op-instructions;;
 	0xcd)	src=kx-fr-interrupts-masked-cycles;;
 	0xce)	src=kx-fr-interrupts-masked-while-pending-cycles;;
 	0xcf)	src=kx-fr-hardware-interrupts;;
 	0xd0)	src=kx-fr-decoder-empty;;
 	0xd1)	src=kx-fr-dispatch-stalls;;
 	0xd2)	src=kx-fr-dispatch-stall-from-branch-abort-to-retire;;
 	0xd3)	src=kx-fr-dispatch-stall-for-serialization;;
 	0xd4)	src=kx-fr-dispatch-stall-for-segment-load;;
 	0xd5)	src=kx-fr-dispatch-stall-when-reorder-buffer-is-full;;
 	0xd6)	src=kx-fr-dispatch-stall-when-reservation-stations-are-full;;
 	0xd7)	src=kx-fr-dispatch-stall-when-fpu-is-full;;
 	0xd8)	src=kx-fr-dispatch-stall-when-ls-is-full;;
 	0xd9)	src=kx-fr-dispatch-stall-when-waiting-for-all-to-be-quiet;;
 	0xda)	src=kx-fr-dispatch-stall-when-far-xfer-or-resync-br-pending;;
 	0xdb)	src=k8-fr-fpu-exceptions;;
 	0xdc)	src=k8-fr-number-of-breakpoints-for-dr0;;
 	0xdd)	src=k8-fr-number-of-breakpoints-for-dr1;;
 	0xde)	src=k8-fr-number-of-breakpoints-for-dr2;;
 	0xdf)	src=k8-fr-number-of-breakpoints-for-dr3;;
 	0xe0)	src=k8-nb-memory-controller-page-access-event;;
 	0xe1)	src=k8-nb-memory-controller-page-table-overflow;;
 	0xe2)	src=k8-nb-memory-controller-dram-slots-missed;;
 	0xe3)	src=k8-nb-memory-controller-turnaround;;
 	0xe4)	src=k8-nb-memory-controller-bypass-saturation;;	# XXX
 	0xe5)	src=k8-nb-sized-commands;;			# XXX
 	0xec)	src=k8-nb-probe-result;;			# XXX
 	0xf6)	src=k8-nb-ht-bus0-bandwidth;;
 	0xf7)	src=k8-nb-ht-bus1-bandwidth;;
 	0xf8)	src=k8-nb-ht-bus2-bandwidth;;
 	0xfc)	src=k8-nb-unknown-$ctr1;;
 	*)	src=very-unknown-$ctr1;;
 	esac
 	case $src in
 	k8-*)		ctr=$(($ctr + 1)); continue;;
 	*unknown-*)	ctr=$(($ctr + 1)); continue;;
 	esac
 	echo "# s/$src "
 	perfmon -c "$c" -ou -l 1 $ctr |
 	    egrep -v '(^total: |^mean: |^clocks \(at)' | sed -e 's/1: //'
 	ctr=$(($ctr + 1))
done
%%%

runpmc:
%%%
for i in \
 	k8-fp-dispatched-fpu-ops \
 	k8-fp-cycles-with-no-fpu-ops-retired \
 	k8-fp-dispatched-fpu-fast-flag-ops \
 	k8-ls-segment-register-load \
 	k8-ls-microarchitectural-resync-by-self-modifying-code \
 	k8-ls-microarchitectural-resync-by-snoop \
 	k8-ls-buffer2-full \
 	k8-ls-locked-operation \
 	k8-ls-microarchitectural-late-cancel \
 	k8-ls-retired-cflush-instructions \
 	k8-ls-retired-cpuid-instructions \
 	k8-dc-access \
 	k8-dc-miss \
 	k8-dc-refill-from-l2 \
 	k8-dc-refill-from-system \
 	k8-dc-copyback \
 	k8-dc-l1-dtlb-miss-and-l2-dtlb-hit \
 	k8-dc-l1-dtlb-miss-and-l2-dtlb-miss \
 	k8-dc-misaligned-data-reference \
 	k8-dc-microarchitectural-late-cancel-of-an-access \
 	k8-dc-microarchitectural-early-cancel-of-an-access \
 	k8-dc-one-bit-ecc-error \
 	k8-dc-dispatched-prefetch-instructions \
 	k8-dc-dcache-accesses-by-locks \
 	k8-bu-cpu-clk-unhalted \
 	k8-bu-internal-l2-request \
 	k8-bu-fill-request-l2-miss \
 	k8-bu-fill-into-l2 \
 	k8-ic-fetch \
 	k8-ic-miss \
 	k8-ic-refill-from-l2 \
 	k8-ic-refill-from-system \
 	k8-ic-l1-itlb-miss-and-l2-itlb-hit \
 	k8-ic-l1-itlb-miss-and-l2-itlb-miss \
 	k8-ic-microarchitectural-resync-by-snoop \
 	k8-ic-instruction-fetch-stall \
 	k8-ic-return-stack-hit \
 	k8-ic-return-stack-overflow \
 	k8-fr-retired-x86-instructions \
 	k8-fr-retired-uops \
 	k8-fr-retired-branches \
 	k8-fr-retired-branches-mispredicted \
 	k8-fr-retired-taken-branches \
 	k8-fr-retired-taken-branches-mispredicted \
 	k8-fr-retired-far-control-transfers \
 	k8-fr-retired-resyncs \
 	k8-fr-retired-near-returns \
 	k8-fr-retired-near-returns-mispredicted \
 	k8-fr-retired-taken-branches-mispredicted-by-addr-miscompare \
 	k8-fr-retired-fpu-instructions \
 	k8-fr-retired-fastpath-double-op-instructions \
 	k8-fr-interrupts-masked-cycles \
 	k8-fr-interrupts-masked-while-pending-cycles \
 	k8-fr-taken-hardware-interrupts \
 	k8-fr-decoder-empty \
 	k8-fr-dispatch-stalls \
 	k8-fr-dispatch-stall-from-branch-abort-to-retire \
 	k8-fr-dispatch-stall-for-serialization \
 	k8-fr-dispatch-stall-for-segment-load \
 	k8-fr-dispatch-stall-when-reorder-buffer-is-full \
 	k8-fr-dispatch-stall-when-reservation-stations-are-full \
 	k8-fr-dispatch-stall-when-fpu-is-full \
 	k8-fr-dispatch-stall-when-ls-is-full \
 	k8-fr-dispatch-stall-when-waiting-for-all-to-be-quiet \
 	k8-fr-dispatch-stall-when-far-xfer-or-resync-branch-pending \
 	k8-fr-fpu-exceptions \
 	k8-fr-number-of-breakpoints-for-dr0 \
 	k8-fr-number-of-breakpoints-for-dr1 \
 	k8-fr-number-of-breakpoints-for-dr2 \
 	k8-fr-number-of-breakpoints-for-dr3 \
 	k8-nb-memory-controller-page-table-overflow \
 	k8-nb-memory-controller-dram-slots-missed \
 	k8-nb-memory-controller-bypass-saturation \
 	k8-nb-sized-commands \
 	k8-nb-probe-result

do
 	pmcstat -s $i sleep 1 2>&1 | sed -e 's/^ *//' -e 's/  */ /' \
 	    -e 's/ *$//' -e 's/\/00\/k8-/\/k8-/'
done
%%%

runpmc7:
%%%
for i in \
 	k7-dc-accesses \
 	k7-dc-misses \
 	k7-dc-refills-from-l2 \
 	k7-dc-refills-from-system \
 	k7-dc-writebacks \
 	k7-l1-dtlb-miss-and-l2-dtlb-hits \
 	k7-l1-and-l2-dtlb-misses \
 	k7-misaligned-references \
 	k7-ic-fetches \
 	k7-ic-misses \
 	k7-l1-itlb-misses \
 	k7-l1-l2-itlb-misses \
 	k7-retired-instructions \
 	k7-retired-ops \
 	k7-retired-branches \
 	k7-retired-branches-mispredicted \
 	k7-retired-taken-branches \
 	k7-retired-taken-branches-mispredicted \
 	k7-retired-far-control-transfers \
 	k7-retired-resync-branches \
 	k7-interrupts-masked-cycles \
 	k7-interrupts-masked-while-pending-cycles \
 	k7-hardware-interrupts

do
 	pmcstat -s $i sleep 1 2>&1 |
 	    sed -e 's/^ *//' -e 's/  */ /' -e 's/ *$//' -e 's/k7/kx/'
done
%%%

"runpm" tries up to all 256 perfomance counters, with names like the
hwpmc ones.  k7 means AthlonXP only; k8 means Athlon64 only; kx means
both, but many kx's don't really work or are not documented for both.
A few like kx-fr-retired-near-returns-mispredicted (?) are not documented
for AXP but seem to work and are useful.

runpmc tries the documented A64 counters.  runpmc7 tries the documented
AXP counters.  hwpmc is less useful than perfmon here since it doesn't
support using the undocumented counters.  There is a pmc* option that
prints all the counters in the above lists.  I checked that they are
almost precisely the documented (in Athlon optimization manuals) ones.
Names are unfortunately inconsistent between k7 and k8 in some cases,
following inconsistencies in the documentation.

I don't know anything about Pentium counters except what is in source
code.

gprof output for the mumble perfmon counter (kx-dc-misses?) while sending
100000 tiny packets using ttcp -t looks like this (after fixing the
calibration):

%%%
granularity: each sample hit covers 16 byte(s) for 0.00% of 2.81 seconds

   %   cumulative   self              self     total
  time   seconds   seconds    calls  us/call  us/call  name
  11.0      0.308    0.308   100083        3       24  syscall [3]
  10.8      0.613    0.305   200012        2        2  rn_match [16]
   4.4      0.738    0.125   100019        1        1  _bus_dmamap_load_buffer [25]
   4.3      0.859    0.121   300107        0        0  generic_copyin [27]
   4.0      0.973    0.114   100006        1        9  ip_output [10]
   3.8      1.079    0.106   100006        1        4  ether_output [12]
   3.7      1.182    0.103   100007        1        1  fgetsock [30]
   3.6      1.284    0.102   100006        1        2  bus_dmamap_load_mbuf [21]
   3.6      1.385    0.101   200012        1        3  rtalloc_ign [11]
   3.6      1.486    0.101   100083        1       25  Xint0x80_syscall [2]
   3.6      1.587    0.101   200012        1        1  in_clsroute [32]
   3.6      1.688    0.101   100006        1       20  sendto [4]
   3.6      1.789    0.101   100008        1        1  in_pcblookup_hash [33]
   3.6      1.890    0.101   100006        1       16  kern_sendit [6]
   3.6      1.990    0.100   200012        1        2  in_matroute [15]
   3.6      2.091    0.100   100748        1        1  doreti [35]
   3.6      2.191    0.100   100007        1        2  getsockaddr [22]
%%%

I would like to be able to do this with hwpmc but don't see how it can.
Only (non-statistical) counting at every function call and return can
give precise counts like the above.  However, non-statistical counting
is better for some things.

Back to the original problem.  Uma allocation overhead shows up in TSC
profiles of ttcp, but is just one of too many things that take a while.
There are about function calls, each taking about 1%:

% granularity: each sample hit covers 16 byte(s) for 0.00% of 0.86 seconds
% 
%   %   cumulative   self              self     total 
%  time   seconds   seconds    calls  ns/call  ns/call  name 
%  44.9      0.388    0.388        0  100.00%           mcount [1]
%  20.9      0.569    0.180        0  100.00%           mexitcount [7]
%   8.0      0.638    0.069        0  100.00%           cputime [14]
%   1.8      0.654    0.016        0  100.00%           user [25]
%   1.6      0.668    0.014   100006      143     1051  udp_output [12]
%   1.5      0.681    0.013   100006      133      704  ip_output [13]
%   1.3      0.692    0.011   300120       37       37  copyin [30]
%   1.2      0.702    0.010   100006      100     1360  sosend_dgram [10]
%   0.9      0.710    0.008   200012       39       39  rn_match [38]
%   0.9      0.718    0.007   300034       25       25  memcpy [39]
%   0.8      0.725    0.007   200103       36       58  uma_zalloc_arg [29]
%   0.8      0.732    0.007   100090       68     1977  syscall [4]

All the times seem reasonable.  Without profiling, sendto() and overheads
takes about 1700 nsec in -current and about 1600 nsec in my version
of 5.2.  (This is for -current.  The 100 nsec difference is very hard
to understand in detail.)  With high resolution kernel profiling, sendto()
and overheads take about 8600 nsec here.  Profiling has subtracted its
own overheads and the result of 1977 nsec for syscall is consistent with
syscall() taking a bit less that 1700 nsec when not looked at.  (Profiling
only subtracts its best-case overheads.  Its runtime overheads must be
larger due to cache effects, and if these are very large then we cannot
trust the compensation.  Since it compensated from 8600 down to about 1977,
it has clearly down the compensation almost right.  The compensation is
delicate when there are a lot of functions taking ~20 nsec since the profiling
overhead per function call is 82 nsec.

%   0.8      0.738    0.007   200012       33       84  rtalloc1 [24]
%   0.8      0.745    0.006   100006       65      139  bge_encap [26]
%   0.7      0.751    0.006   100006       62      201  bge_start_locked [20]
%   0.6      0.757    0.006   200075       28       28  bzero [44]
%   0.6      0.761    0.005   100006       48      467  ether_output [15]
%   0.6      0.766    0.005   100006       48      192  m_uiotombuf [22]
%   0.5      0.771    0.005   200100       23       45  uma_zfree_arg [33]
%   0.5      0.775    0.005   100006       46       46  bus_dmamap_load_mbuf_sg [45]
%   0.5      0.780    0.004   100028       45      132  malloc [27]
%   0.5      0.784    0.004   200012       20      104  rtalloc_ign [19]

This is hard to optimize.

uma has shown up as taking 58 nsec for uma_zalloc_arg() (including what it
calls) and 45 nsec for uma_zfree_arg().  This is on a 2.2GHz A64.  Anything
larger than that might be a profiling error.  But thes allocations here are
tiny -- maybe large allocations cause cache methods.

I was able to optimize away most the allocation overheads in sendto()
be allocating the sockaddr on the stack, but this made little difference
overall.  (It reduces dynamic allocations per packet from 2 to 1.  Both
allocations use malloc() so they are a bit slower than pure uma.  BTW,
switching from mbuf-based allocation to malloc() in getsockaddr() etc.
long ago cost 10 usec on a P1/133.  A loss of 10000 nsec makes the overhead
of 200 nsec for malloc now look tiny.)

Remember I said that differences of 100 nsec are hard to understand?
It is also not easy to understand why eliminating potentially 100 nsec
of malloc() overhead makes almost no difference overall.  The 100 nsec
gets distributed differently, or maybe the profiling really was wrong
for the malloc() part.

Reads of the TSC are excuted possibly-out-of-order on some CPUs.  This
doesn't seem to have much effect on the accuracy of high resolution
(TSC) kernel profiling on at least Athlons.  rdtsc takes only 12 cycles
on AXP-A64.  I think it takes much longer on Pentiums.  On Phenom it
takes ~42 cycles (pessimized to share it across CPUs :-().  With it
taking much longer than the functions that it profiles, the compensation
might become too fragile.

Bruce