Performance of SheevaPlug on 8-stable

Sun Mar 7 21:25:43 UTC 2010

FreeBSD-current has kernel and user witness turned on. Witness is for
locks, so it should not change the performance of a tight arithmetic loop
like this.

I don't know the marvell interals, and from what I tell, their technial
docs require NDA. That said, many of the ARM processors also have a
instruction internal cache (instruction prefetch) in addition to the
instruction cache. I don't think the prefetch has an enable/disable.

It looks like from the cpu identification that the the branch prediction
is turned on. Branch prediction compensates for the longer pipelines.
I can't see how in the tight loop how that could go astray.

Thus says the ARM ARM:

	ARM implementations are free to choose how far ahead of the
	current point of execution they prefetch instructions; either
	a fixed or a dynamically varying number of instructions. As well
	as being free to choose how many instructions to prefetch, an ARM
	implementation can choose which possible future execution path to
	prefetch along. For example, after a branch instruction, it can
	choose to prefetch either the instruction following the branch
	or the instruction at the branch target. This is known as branch
	prediction.

There are a few data dangling allocations that I would like to see
closed from the multiple kernel allocation fix. *IN THEORY, IF* a page
is allocated via the arm_nocache (DMA COHERENT) or a sendfile, then
it is never marked as unallocated. *IN THEORY*, if that page is used
again, then we could falsely believe that page is being shared and
we turn off the cache, eventhough it is not shared.

	http://www.casselton.net/~tinguely/arm_pmap_unmanaged.diff

* Disclaimer: I am not sure if DMA COHERENT nor sendfiles are used in
the Sheeva implementation. This is a theoritical observation of a side
effect of the multiple kernel mapping patch that we did just before
FreeBSD 8-release.

--Mark Tinguely