Performance of SheevaPlug on 8-stable

Tue Jan 31 03:31:47 UTC 2012

I would like to revive and add to this old topic.  I'm using the
original subject line of the threads from long ago just to help out
anyone searching for info in the future; we ran into the problem on
Atmel at91rm9200 hardware rather than SheevaPlug.  The original threads
are archived here:

  http://lists.freebsd.org/pipermail/freebsd-arm/2010-March/002243.html
  http://lists.freebsd.org/pipermail/freebsd-arm/2010-November/002635.html

To summarize them... Ever since 8.0, performance of userland code on arm
systems using VIVT cache ranges from bad to unusable, with symptoms that
tend to be hard to nail down definitively.  Much of the evidence pointed
to the instruction cache being disabled on some pages of apps and shared
libraries, sometimes.  Mark Tinguely explained about how and why it's
necessary to disable caching on a page when there are multiple mappings
and at least one is a writable mapping.  There were some patches of
pmap-layer code that had visible effects but never really fixed the
problem.  I don't think anybody ever definitively nailed down why some
executable pages seem to permanently lose their icache enable bit.

I tracked down the cause and developed a workaround (I'll post patches),
but to really fix the problem I would need a lot of help from VM/VFS
gurus.  

I apologize in advance for a bit of hand-waving in what follows here.
It was months ago that I was immersed in this problem; now I'm working
from a few notes and a fading memory.  I figured I'd better just post
before it fades completely, and hopefully some ensuing discussion will
help me remember more details.  I also still have a couple sandboxes
built with instrumented code that I could dust off and run with, to help
answer any questions that arise.

One of the most confusing symptoms of the problem is that performance
can change from run to run, and most especially it can change after
rebooting.  It turns out the run-to-run differences are based on what
type of IO brought each executable page into memory.  

When portions of an executable file (including shared libs) are read or
written using "normal IO" such as read(2), write(2), etc -- calls that
end up in ffs_read() and ffs_write() -- a kernel-writable mapping for
the pages is made before the physical IO is initiated, and that mapping
stays in place and icache is disabled on those pages as long as the
buffer remains in the cache (which for something like libc means
forever).

When pages are mapped as executable with mmap(2) and then the IO is done
via demand paging when the pages are accessed, a temporary kernel
writable mapping is made for the duration of the IO operation and then
is removed again when the physical IO is completed (leaving just a
read/execute mapping).  When the last writable mapping is removed the
icache bit is restored on the page.

(Semi-germane aside: the aio routines appear to work like the pager IO,
making a temporary writable kva mapping only for the duration of the
physical IO.)

The cause of the variability in symptoms is a race between the two types
of IO that happens when shared libs are loaded.  The race is kicked off
by libexec/rtld-elf/map_object.c; it uses pread(2) to load the first 4K
of a file to read the headers so that it can mmap() the file as needed.
The pread() eventually lands in ffs_read() which decides to do a cluster
read or normal read-ahead.  Usually the read-ahead IO gets the blocks
into the buffer cache (and thus disables icache on all those pages)
before map_object() gets much work done, so the first part of a shared
library usually ends up icache-disabled.  If it's a small shared lib the
whole library may end up icache-disabled due to read-ahead.  

Other times it appears that map_object() gets the pages mapped and
triggers demand-paging IO which completes before the readahead IO
initiated by the first 4K read, and in those cases the icache bit on the
pages gets turned back on when the temporary kernel mappings are
unmapped.

So when cluster or read-ahead IO wins the race, app performance is bad
until the next reboot or filesystem unmount or something else pushes
those blocks out of the buffer cache (which never happens on our
embedded systems).  How badly the app performs depends on what shared
libs it uses and the results of the races as each lib was loaded.  When
some demand-paging IO completes before the corresponding read-ahead IO
for the blocks at the start of a library, it seems to cause any further
read-ahead to stop as I remember it, so the app doesn't take such a big
performance hit, sometimes hardly any hit at all.  

In addition to the races on loading shared libs, doing "normal IO
things" to executable files and libs, such as compiling a new copy or
using cp or even 'cat app >/dev/null' which I think came up in the
original thread, will cause that app to execute without icache on its
pages until its blocks are pushed out of the buffer cache.

Here's where I have to be extra-hand-wavy... I think the right way to
fix this is to make ffs_read/write (or maybe I guess all vop_read and
vop_write implementations) work more like aio and pager io in the sense
that they should make a temporary kva mapping that lasts only as long as
it takes to do the physical IO and associated uio operations.  I vaguely
remember thinking that the place to make that happen was along the lines
of doing the mapping in getblk() (or maybe breada()?) and unmapping it
in bdone(), but I was quite frankly lost in that twisty maze of code and
never felt like I understood it well enough to even make an experimental
stab at such changes.

I have two patches related to this stuff.  They were generated from 8.2
sources but I've confirmed that they apply properly to -current.

One patch modifies map_object.c to use mmap()+memcpy() instead of
pread().  I think it's a useful enhancement even without its effect on
this icache problem, because it seems to me that doing a readahead on a
shared library will bring in pages that may never be referenced and
wouldn't have required any physical memory or IO resources if the
readahead hadn't happened.

The other is a pure hack-workaround that's most helpful when you're
developing code for an arm platform.  It forces on the O_DIRECT flag in
ffs_write() (and optionally ffs_read() but that's disabled by default)
for executable files, to keep the blocks out of the buffer cache when
doing normal IO stuff.  It's ugly brute force, but it's good enough to
let us develop and deploy embedded systems code using FreeBSD 8.2.  This
is not to be committed, this is just a workaround that let us start
using 8.2 before finding a real fix to the root problem.  Anyone else
trying to work with 8.0 or later on VIVT-cache arm chips might find it
useful until a proper fix is developed.

-- Ian

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ffs_vnops_icache_hack.patch
Type: text/x-patch
Size: 1570 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-arm/attachments/20120131/837e4416/ffs_vnops_icache_hack.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: rtld-elf-map_object.patch
Type: text/x-patch
Size: 886 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-arm/attachments/20120131/837e4416/rtld-elf-map_object.bin