Performance of SheevaPlug on 8-stable

Tue Jan 31 05:45:46 UTC 2012

Hi Ian,

Do you have any data on what 9.0 does?

Warner

On Jan 30, 2012, at 8:31 PM, Ian Lepore wrote:

> I would like to revive and add to this old topic.  I'm using the
> original subject line of the threads from long ago just to help out
> anyone searching for info in the future; we ran into the problem on
> Atmel at91rm9200 hardware rather than SheevaPlug.  The original threads
> are archived here:
> 
>  http://lists.freebsd.org/pipermail/freebsd-arm/2010-March/002243.html
>  http://lists.freebsd.org/pipermail/freebsd-arm/2010-November/002635.html
> 
> To summarize them... Ever since 8.0, performance of userland code on arm
> systems using VIVT cache ranges from bad to unusable, with symptoms that
> tend to be hard to nail down definitively.  Much of the evidence pointed
> to the instruction cache being disabled on some pages of apps and shared
> libraries, sometimes.  Mark Tinguely explained about how and why it's
> necessary to disable caching on a page when there are multiple mappings
> and at least one is a writable mapping.  There were some patches of
> pmap-layer code that had visible effects but never really fixed the
> problem.  I don't think anybody ever definitively nailed down why some
> executable pages seem to permanently lose their icache enable bit.
> 
> I tracked down the cause and developed a workaround (I'll post patches),
> but to really fix the problem I would need a lot of help from VM/VFS
> gurus.  
> 
> I apologize in advance for a bit of hand-waving in what follows here.
> It was months ago that I was immersed in this problem; now I'm working
> from a few notes and a fading memory.  I figured I'd better just post
> before it fades completely, and hopefully some ensuing discussion will
> help me remember more details.  I also still have a couple sandboxes
> built with instrumented code that I could dust off and run with, to help
> answer any questions that arise.
> 
> One of the most confusing symptoms of the problem is that performance
> can change from run to run, and most especially it can change after
> rebooting.  It turns out the run-to-run differences are based on what
> type of IO brought each executable page into memory.  
> 
> When portions of an executable file (including shared libs) are read or
> written using "normal IO" such as read(2), write(2), etc -- calls that
> end up in ffs_read() and ffs_write() -- a kernel-writable mapping for
> the pages is made before the physical IO is initiated, and that mapping
> stays in place and icache is disabled on those pages as long as the
> buffer remains in the cache (which for something like libc means
> forever).
> 
> When pages are mapped as executable with mmap(2) and then the IO is done
> via demand paging when the pages are accessed, a temporary kernel
> writable mapping is made for the duration of the IO operation and then
> is removed again when the physical IO is completed (leaving just a
> read/execute mapping).  When the last writable mapping is removed the
> icache bit is restored on the page.
> 
> (Semi-germane aside: the aio routines appear to work like the pager IO,
> making a temporary writable kva mapping only for the duration of the
> physical IO.)
> 
> The cause of the variability in symptoms is a race between the two types
> of IO that happens when shared libs are loaded.  The race is kicked off
> by libexec/rtld-elf/map_object.c; it uses pread(2) to load the first 4K
> of a file to read the headers so that it can mmap() the file as needed.
> The pread() eventually lands in ffs_read() which decides to do a cluster
> read or normal read-ahead.  Usually the read-ahead IO gets the blocks
> into the buffer cache (and thus disables icache on all those pages)
> before map_object() gets much work done, so the first part of a shared
> library usually ends up icache-disabled.  If it's a small shared lib the
> whole library may end up icache-disabled due to read-ahead.  
> 
> Other times it appears that map_object() gets the pages mapped and
> triggers demand-paging IO which completes before the readahead IO
> initiated by the first 4K read, and in those cases the icache bit on the
> pages gets turned back on when the temporary kernel mappings are
> unmapped.
> 
> So when cluster or read-ahead IO wins the race, app performance is bad
> until the next reboot or filesystem unmount or something else pushes
> those blocks out of the buffer cache (which never happens on our
> embedded systems).  How badly the app performs depends on what shared
> libs it uses and the results of the races as each lib was loaded.  When
> some demand-paging IO completes before the corresponding read-ahead IO
> for the blocks at the start of a library, it seems to cause any further
> read-ahead to stop as I remember it, so the app doesn't take such a big
> performance hit, sometimes hardly any hit at all.  
> 
> In addition to the races on loading shared libs, doing "normal IO
> things" to executable files and libs, such as compiling a new copy or
> using cp or even 'cat app >/dev/null' which I think came up in the
> original thread, will cause that app to execute without icache on its
> pages until its blocks are pushed out of the buffer cache.
> 
> Here's where I have to be extra-hand-wavy... I think the right way to
> fix this is to make ffs_read/write (or maybe I guess all vop_read and
> vop_write implementations) work more like aio and pager io in the sense
> that they should make a temporary kva mapping that lasts only as long as
> it takes to do the physical IO and associated uio operations.  I vaguely
> remember thinking that the place to make that happen was along the lines
> of doing the mapping in getblk() (or maybe breada()?) and unmapping it
> in bdone(), but I was quite frankly lost in that twisty maze of code and
> never felt like I understood it well enough to even make an experimental
> stab at such changes.
> 
> I have two patches related to this stuff.  They were generated from 8.2
> sources but I've confirmed that they apply properly to -current.
> 
> One patch modifies map_object.c to use mmap()+memcpy() instead of
> pread().  I think it's a useful enhancement even without its effect on
> this icache problem, because it seems to me that doing a readahead on a
> shared library will bring in pages that may never be referenced and
> wouldn't have required any physical memory or IO resources if the
> readahead hadn't happened.
> 
> The other is a pure hack-workaround that's most helpful when you're
> developing code for an arm platform.  It forces on the O_DIRECT flag in
> ffs_write() (and optionally ffs_read() but that's disabled by default)
> for executable files, to keep the blocks out of the buffer cache when
> doing normal IO stuff.  It's ugly brute force, but it's good enough to
> let us develop and deploy embedded systems code using FreeBSD 8.2.  This
> is not to be committed, this is just a workaround that let us start
> using 8.2 before finding a real fix to the root problem.  Anyone else
> trying to work with 8.0 or later on VIVT-cache arm chips might find it
> useful until a proper fix is developed.
> 
> -- Ian
> 
> --- sys/ufs/ffs/ffs_vnops.c	Thu Jun 16 14:43:20 2011 -0600
> +++ sys/ufs/ffs/ffs_vnops.c	Mon Jan 30 17:54:44 2012 -0700
> @@ -467,6 +467,18 @@ ffs_read(ap)
> 	seqcount = ap->a_ioflag >> IO_SEQSHIFT;
> 	ip = VTOI(vp);
> 
> +	// This hack ensures that executable code never ends up in the buffer cache.
> +	// It is currently disabled.
> +	// It helps work around disabled-icache due to kernel-writable mappings.
> +	// However, shell script files are executable and caching them is useful, so
> +	// this is disabled for now.  With the rtld-elf mmap() patch in place,
> +	// nothing normally ever calls read on an executable file so this code
> +	// doesn't buy us much.
> +#if 0 && defined(__arm__)
> +	if (vp->v_type == VREG && ip->i_mode & IEXEC)
> +		ioflag |= IO_DIRECT;
> +#endif    
> +
> #ifdef INVARIANTS
> 	if (uio->uio_rw != UIO_READ)
> 		panic("ffs_read: mode");
> @@ -670,6 +682,17 @@ ffs_write(ap)
> 	seqcount = ap->a_ioflag >> IO_SEQSHIFT;
> 	ip = VTOI(vp);
> 
> +	// This hack ensures that executable code never ends up in the buffer cache.
> +	// It helps work around disabled-icache due to kernel-writable mappings.
> +	// On a deployed production system, nothing normally ever calls write() on
> +	// an executable file.  This hack exists to allow development on the system
> +	// (so that you can do things like copy a new executable onto the system
> +	// without having that destroy performance on subsequent runs).
> +#if defined(__arm__)
> +	if (vp->v_type == VREG && ip->i_mode & IEXEC)
> +		ioflag |= IO_DIRECT;
> +#endif
> +
> #ifdef INVARIANTS
> 	if (uio->uio_rw != UIO_WRITE)
> 		panic("ffs_write: mode");
> diff -r 0cb0be36b70f libexec/rtld-elf/map_object.c
> --- libexec/rtld-elf/map_object.c	Thu Jun 16 14:43:20 2011 -0600
> +++ libexec/rtld-elf/map_object.c	Mon Jan 30 20:03:45 2012 -0700
> @@ -272,11 +272,16 @@ get_elf_header (int fd, const char *path
> 	char buf[PAGE_SIZE];
>     } u;
>     ssize_t nbytes;
> +    void *mapped;
> 
> -    if ((nbytes = pread(fd, u.buf, PAGE_SIZE, 0)) == -1) {
> -	_rtld_error("%s: read error: %s", path, strerror(errno));
> +    /* Use mmap() + memcpy() rather than [p]read() to avoid readahead. */
> +    nbytes = sizeof(u.buf);
> +    if ((mapped = mmap(NULL, nbytes, PROT_READ, 0, fd, 0)) == (caddr_t) -1) {
> +	_rtld_error("%s: mmap of header failed: %s", path, strerror(errno));
> 	return NULL;
>     }
> +    memcpy(u.buf, mapped, nbytes);
> +    munmap(mapped, nbytes);
> 
>     /* Make sure the file is valid */
>     if (nbytes < (ssize_t)sizeof(Elf_Ehdr) || !IS_ELF(u.hdr)) {
> _______________________________________________
> freebsd-arm at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-arm
> To unsubscribe, send any mail to "freebsd-arm-unsubscribe at freebsd.org"