80386 support in -current
Robert Watson
rwatson at freebsd.org
Sun Jan 25 18:36:29 PST 2004
On Mon, 26 Jan 2004, Peter Jeremy wrote:
> >This last point is the clincher. The chip does NOT have enough "umphf". I
> >actually managed to boot a -current (from back then) on a 80386SX and it
> >was torturously slow. An ls(1) on my empty home directory took 15 seconds.
> >My VAX is faster.
>
> This is a bug in FreeBSD 5.x - the performance in general has degraded
> since 4.x. Performance degradation is often more obvious in lower end
> machines.
There are some areas where performance is improved, and several important
areas where it's worse. I'd encourage all FreeBSD developers to look at
areas where it's worse and fix things :-). That said, I know there's a
fair amount of work going on relating to performance optimization, and
hopefully we'll start to see some of those results in the near future.
FWIW, I actually measure a pretty dramatic improvement in network
benchmarks on 5.x relative to 4.x in the SMP case through increased
parallelism and asynchrony. The areas I'm aware of that require
particular attention at this point include:
- Improving interrupt latency. We've moved to ithreads, but haven't spent
enough time optimizing the performance of our ithread implementation.
Bosko did a sample i386 implementation of light weight context switches
last year, but at that time we didn't have enough device driver locking
to take advantage of it. We're now in much better shape locking-wise,
with a lot more just around the corner, so we need to focus on interrupt
latency. We held a conference call a few days ago to get some of the
interested parties together (Bosko, Jeff, et al), and it looks like
Peter Wemm has foolishly signed up to update/re-implement on a recent
5.x. Use of the IO APIC is necessary for SMP systems, but also provides
a fair amount of additional overhead. In some recent uniprocessor
benchmarking, I saw an observable overhead for using 'device apic' -- it
could be we want to back off the use of device apic on these systems.
- General optimization of locking. We've put in a fair number of locks,
and pushed Giant off some of the interesting paths (i.e., pipe
locking). We now need to look at lock granularity. I recently
committed some changes to our mutex profiling code to measure lock
contention. I suspect we're not seeing a lot of contention, with the
exception of Giant, and so we might actually want to look at reducing
the number of locks using mutex pools (where possible) to lower memory
overhead. We have a number of tools here that can help us, and now
things are maturing locking wise, we should use them. We are also
likely pretty close to pushing Giant further off a number of pieces of
process-related code, which should help quite a bit with things like
large builds.
- Get the socket locking into the tree. Large parts of the network stack
can now run Giant-free, and there are substantial outstanding patches
for a lot more. Cleanup is required, but hopefully we'll see some
patches posted for testing soon. There are some areas of the network
stack that require substantial further attention -- for example, the
KAME code requires additional locking work to run Gaint-free.
- Reduce the overhead of in-kernel thread context switching. We do more
context switching than we used to, not just because of ithreads, but
also because we have used threads to increase asynchrony and serialize
work queues.
- Reduce the cost of lock operations. There have been some suggestions
that our current mutexes consume more memory than necessary in
non-debugging cases, and also are more expensive than necessary in some
cases.
- Explore additional use of the UMA slab allocator. In particular, see
whether using it can help improve performance with System V IPC, where
currently the implementation does its own memory caching and handling.
There have also been some proposals to increase use of UMA in the
network stack, use it further for sockets, etc. I know there has also
been some experimentation with using UMA to replace the current mbuf
allocator.
- Trim unneeded fields from a number of kernel structures. As KSE went
in, struct proc was broken out into a number of pieces. In some cases,
variables lived on in multiple structures, and can now be cleaned out.
Likewise in other kernel data structures.
- Take better advantage of CPU class optimizations. There has been some
discussion of providing HAL modules for the kernel, and libraries for
userspace, based on the CPU type to improve performance. I.e.,
optimized mutex, memory zeroing, context switching, et al. Right now we
do a fairly poor job at picking up these optimizations, and carry around
a lot of memory overhead to support a large set. We need to do a better
job where possible -- we should really see the results if we're able to
optimize code such as the crypto code for specific CPUs.
Robert N M Watson FreeBSD Core Team, TrustedBSD Projects
robert at fledge.watson.org Senior Research Scientist, McAfee Research
More information about the freebsd-current
mailing list