80386 support in -current

Sun Jan 25 18:36:29 PST 2004

On Mon, 26 Jan 2004, Peter Jeremy wrote:

> >This last point is the clincher. The chip does NOT have enough "umphf". I
> >actually managed to boot a -current (from back then) on a 80386SX and it
> >was torturously slow. An ls(1) on my empty home directory took 15 seconds.
> >My VAX is faster.
> 
> This is a bug in FreeBSD 5.x - the performance in general has degraded
> since 4.x.  Performance degradation is often more obvious in lower end
> machines. 

There are some areas where performance is improved, and several important
areas where it's worse.  I'd encourage all FreeBSD developers to look at
areas where it's worse and fix things :-).  That said, I know there's a
fair amount of work going on relating to performance optimization, and
hopefully we'll start to see some of those results in the near future. 
FWIW, I actually measure a pretty dramatic improvement in network
benchmarks on 5.x relative to 4.x in the SMP case through increased
parallelism and asynchrony.  The areas I'm aware of that require
particular attention at this point include: 

- Improving interrupt latency.  We've moved to ithreads, but haven't spent
  enough time optimizing the performance of our ithread implementation. 
  Bosko did a sample i386 implementation of light weight context switches
  last year, but at that time we didn't have enough device driver locking
  to take advantage of it.  We're now in much better shape locking-wise,
  with a lot more just around the corner, so we need to focus on interrupt
  latency.  We held a conference call a few days ago to get some of the
  interested parties together (Bosko, Jeff, et al), and it looks like
  Peter Wemm has foolishly signed up to update/re-implement on a recent
  5.x.  Use of the IO APIC is necessary for SMP systems, but also provides
  a fair amount of additional overhead.  In some recent uniprocessor
  benchmarking, I saw an observable overhead for using 'device apic' -- it
  could be we want to back off the use of device apic on these systems. 

- General optimization of locking.  We've put in a fair number of locks,
  and pushed Giant off some of the interesting paths (i.e., pipe
  locking).  We now need to look at lock granularity.  I recently
  committed some changes to our mutex profiling code to measure lock
  contention.  I suspect we're not seeing a lot of contention, with the
  exception of Giant, and so we might actually want to look at reducing
  the number of locks using mutex pools (where possible) to lower memory
  overhead.  We have a number of tools here that can help us, and now
  things are maturing locking wise, we should use them.  We are also
  likely pretty close to pushing Giant further off a number of pieces of
  process-related code, which should help quite a bit with things like
  large builds.

- Get the socket locking into the tree.  Large parts of the network stack
  can now run Giant-free, and there are substantial outstanding patches
  for a lot more.  Cleanup is required, but hopefully we'll see some
  patches posted for testing soon.  There are some areas of the network
  stack that require substantial further attention -- for example, the
  KAME code requires additional locking work to run Gaint-free.

- Reduce the overhead of in-kernel thread context switching.  We do more
  context switching than we used to, not just because of ithreads, but
  also because we have used threads to increase asynchrony and serialize
  work queues.

- Reduce the cost of lock operations.  There have been some suggestions
  that our current mutexes consume more memory than necessary in
  non-debugging cases, and also are more expensive than necessary in some
  cases.

- Explore additional use of the UMA slab allocator.  In particular, see
  whether using it can help improve performance with System V IPC, where
  currently the implementation does its own memory caching and handling.
  There have also been some proposals to increase use of UMA in the
  network stack, use it further for sockets, etc.  I know there has also
  been some experimentation with using UMA to replace the current mbuf
  allocator.

- Trim unneeded fields from a number of kernel structures.  As KSE went
  in, struct proc was broken out into a number of pieces.  In some cases,
  variables lived on in multiple structures, and can now be cleaned out. 
  Likewise in other kernel data structures. 

- Take better advantage of CPU class optimizations.  There has been some
  discussion of providing HAL modules for the kernel, and libraries for
  userspace, based on the CPU type to improve performance.  I.e.,
  optimized mutex, memory zeroing, context switching, et al.  Right now we
  do a fairly poor job at picking up these optimizations, and carry around
  a lot of memory overhead to support a large set.  We need to do a better
  job where possible -- we should really see the results if we're able to
  optimize code such as the crypto code for specific CPUs.

Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
robert at fledge.watson.org      Senior Research Scientist, McAfee Research