ports/47061: Conflicting system headers by build of graphics/cqcam

Wed Dec 24 02:20:23 PST 2003

The following reply was made to PR kern/47061; it has been noted by GNATS.

From: Bruce Evans <bde at zeta.org.au>
To: Mark Linimon <linimon at lonesome.com>
Cc: freebsd-gnats-submit at freebsd.org
Subject: Re: ports/47061: Conflicting system headers by build of graphics/cqcam
Date: Wed, 24 Dec 2003 21:14:55 +1100 (EST)

 On Tue, 23 Dec 2003, Mark Linimon wrote:

 >  This is really a kernel problem.  I am going to go ahead and commit a
 >  workaround for this and the one or two other ports with this problem --
 >  but the workaround is basically unacceptable.

 Er, this is really a port[s] problem.  <machine/cpufunc.h> is not intended
 to be included by applications.  There was never any conflict with <string.h>
 in the kernel because the kernel never included <string.h>, and the kernel
 now avoids bogus conflicts, if any, with gcc's builtin ffs() using
 -fno-builtin.

 >  The underlying problem is that machine/cpufunc.h for i386 has had
 >  a definition for a machine function 'ffs' for, oh, say, about 9 years
 >  now.  However, man ffs will show you that there is an ffs(3) function
 >  as well.  Even after reading the source it's not clear to me if these
 >  are supposed to have the same purpose -- someone with a more intimate
 >  knowledge of i386 arch is going to have to rule for certain.

 They are the same.  Last time I checked (less than a year ago), the gcc
 builtin was still slower than the kernel inline except possibly when the
 latter can use non-base-arch instructions like cmov.  amd64's always have
 cmov and always use the builtin.

 ... I checked again.  With the following slightly too simple test:

 %%%
 #include <sys/types.h>
 #include <machine/cpufunc.h>

 int z[4096];

 main()
 {
 	volatile int v;
 	int i, j;

 	for (i = 0; i < 4096; i++)
 		z[i] = 1 << rand();	/* Yes, this is sloppy. */
 	for (j = 0; j < 100000; j++)
 		for (i = 0; i < 4096; i++)
 #ifdef NOBUILTIN
 			v = ffs(z[i]);
 #else
 			v = __builtin_ffs(z[i]);
 #endif
 }
 %%%

 Times on an Athlon XP1600 overclocked by 146/133:

 cc -O -mcpu=pentiumpro -o foo foo.c (default from bsd.cpu.mk)
         3.49 real         3.47 user         0.00 sys
 cc -O -mcpu=pentiumpro -DNOBUILTIN -o foo foo.c (default + kernel ffs())
         3.21 real         3.21 user         0.00 sys
 cc -O -march=pentiumpro -o foo foo.c (gives cmov and works on Athlon XP too):
         3.21 real         3.21 user         0.00 sys

 Here using cmov[e] gives the same amount of optimization as the kernel ffs()
 gets by using a simple conditional branch instead of a slow instruction
 sequence starting with "set"[e].  Mispredicted branches are expensive on
 some arches, but apparently they aren't on Athlons.  The rand() in the
 test was intended to cause mispredicted branches as well as lengthy
 searches, but it doesn't actually.  The branch is never taken since
 z[i] is never 0.  On changing the initialization of z[i] so that the
 branch is taken every second time:

 		if (i & 1)
 			z[i] = 1 << rand();

 the kernel version becomes much faster:

         2.01 real         2.00 user         0.00 sys

 and the other times don't change significantly.  This is presumably
 because the Athlon predicts taking the branch every second time
 perfectly.  The bit-search instruction is very expensive (and always
 takes the same time??) and by branching over it every second time the
 cost per iteration is almost halved.

 A better benchmark might randomize the branches, but this might be
 evey further from real applications since an arg of 0 may be very
 unlikely (or very likely).

 Times on a Celeron 366:
 gcc builtin without cmov (very slow!):
        15.78 real        15.68 user         0.00
 gcc builtin with cmov:
         5.64 real         5.61 user         0.00
 kernel ffs():
         5.85 real         5.81 user         0.00
 kernel ffs() with alternating 0's (again, others not affected by alternating):
         5.62 real         5.58 user         0.00

 Times on an amd64 (sledge = Opteron 244 1804 MHz)

 gcc builtin with cmov:
         2.73 real         2.72 user         0.00 sys
 old kernel ffs():
         3.42 real         3.39 user         0.01 sys
 kernel ffs() with alternating 0's (again, builtin affected by alternating):
         1.82 real         1.82 user         0.00 sys

 So using cmov is actually significtly better than a simple branch on
 amd64's, but only if the arg isn't often 0.

 >  In the meantime, I'm going to hold my nose and commit an include
 >  file to the port that is merely the inb/outb functions.  This is
 >  clearly a hack that should go away once a "correct" solution is found.

 This is approximately correct, not a hack.  The system could provide
 a header that implements inb() and outb() functions for userland (*),
 but <machine/cpufunc.h> is not this header.  It's just a bit much for
 multiple applications to have to duplicate these interfaces.

 (*) They shouldn't exist in the kernel.  Bus-space should be used.

 Bruce