cvs commit: src/sys/gnu/ext2fs ext2_bitops.hext2_linux_balloc.c ext2_linux_ialloc.c ia64-bitops.h

Tue Aug 26 03:00:36 PDT 2003

On Tue, 26 Aug 2003, Marcel Moolenaar wrote:

> On Tue, Aug 26, 2003 at 05:44:40PM +1000, Bruce Evans wrote:
> > On Sun, 24 Aug 2003, Marcel Moolenaar wrote:
> > >   ...
> > >   Log:
> > >   Change of plans: Add ext2_bitops.h with generic and portable
> > >   implementations. Use those on platforms that don't have MD
> > >   ...
> >
> > Thnaks.  I'll deventually do some benchmarks to see if we want the MD
> > (inline assembler) versions at all.
>
> Ok. I didn't write the functions for speed so there may be some
> opportunities after analyzing what the compiler made of it.
>
> BTW: I noticed that GCC has a builtin implementation for ffs().
> It may be beneficial.

Is this more than ffs(3)?  (It could be for bit strings generally.).

I recently benchmarked ffs(3).  Too many builtins are turned off by
-fno-builtin for the kernel, but on i386's we have an inline asm version
of ffs() which is still better than the gcc builtin according to my
benchmarks.  The main difference is that inline asm version uses a
conditional branch while the builtin uses extra instructions to avoid
the branch.  Somehow the branch was always faster on my test systems
(old Celeron, AthlonXP and freefall(?)).  I tested with a moving bit
to try to defeat branch target caches, but forgot to test with alternating
zero and nonzero data which would probably cause more branch mispredictions.
Data which is zero much more often than random data (probability 1/2^32)
may be unusual anyway.  IIRC, the amount by which the extra instructions
were slower was quite machine-dependent, with the AthlonXP handling them
almost as fast as a branch.  The extra instructions look like this:

	movl	$0, %edx
	bsfl	x, %eax		# result less 1 if x was nonzero; else garbage
	sete	%dl		# begin fixup for garbage case
	negl	%edx
	orl	%edx, %eax	# end fixup for garbage case
	incl	%eax		# fix up semantic mismatch between bsf and ffs

This is highly non-parallelizable, so it is probably as inefficient as it
looks to a naive instruction counter.

Of course, any optimizations and pessimizations here are only a few cycles,
so they don't show up except in micro-benchmarks.

Bruce