FreeBSD handles leapsecond correctly

Sat Jan 7 11:40:20 PST 2006

:Matt,
:
:I've been testing network and routing performance over the past two weeks
:with an calibrated Agilent N2X packet generator.  My test box is a dual
:Opteron 852 (2.6Ghz) with Tyan S8228 mobo and Intel dual-GigE in PCI-X-133
:slot. Note that I've run all tests with UP kernels em0->em1.
:
:For stock FreeBSD-7-CURRENT from 28. Dec. 2005 I've got 580kpps with fast-
:forward enabled.  A em(4) patch from Scott Long implementing a taskqueue
:raised this to 729kpps.
:
:For stock DragonFlyBSD-1.4-RC1 I've got 327kpps and then it breaks down and
:never ever passes a packet again until a down/up on the receiving interface.
:net.inet.ip.intr_queue_maxlen has to be set to 200, otherwise it breaks down
:at 252kpps already.  Enabling polling did not make a difference and I've tried
:various settings and combinations without any apparent effect on performance
:(burst=1000, each_burst=50, user_frac=1, pollhz=5000).
:
:What suprised me most, apart from the generally poor performance, is the sharp
:dropoff after max pps and the wedging of the interface.  I didn't see this kind
:of behaviour on any other OS I've tested (FreeBSD and OpenBSD).
:
:-- 
:Andre

    Well, considering that we haven't removed the MP lock from the network
    code yet, I'm not surprised at the poorer performance.  The priority has
    been on getting the algorithms in, correct, and stable, proving their
    potential, but not hacking things up to eek out maximum performance
    before its time.  At the moment there is a great deal of work slated for
    1.5 to properly address many of the issues.

    Remember that the difference between 327kps and 792kps is the difference
    between 3 uS and 1.2 uS per packet of overhead.  That isn't all that
    huge a difference, really, especially considering that everything is
    serialized down to effectively 1 cpu due to the MP lock.

:For stock FreeBSD-7-CURRENT from 28. Dec. 2005 I've got 580kpps with fast-
:forward enabled.  A em(4) patch from Scott Long implementing a taskqueue
:raised this to 729kpps.

    The single biggest overhead we have right now is that we have not 
    yet embedded a LWKT message structure in the mbuf.  That means we
    are currently malloc() and free()ing a message structure for every
    packet, costing at least 700 nS in additional overhead and possibly
    more if a cross-cpu free is needed (even with the passive IPIQ the
    free() code does in that case).  This problem is going to be fixed once
    1.4 is released, but in order to do it properly I intend to completely
    separate the mbuf data vs header concept... give them totally different
    structural names instead of overloading them with a union, then embedding
    the LWKT message structure in the mbuf_pkt.

    Another example would be our IP forwarding code.  Hahahah.  I'm amazed
    that it only takes 3 uS considering that it is running under both the
    MP lock *AND* the new mutex-like serializer locks that will be replacing
    the MP lock in the network subsystem AND hacking up those locks (so there
    are four serializer locking operations per packet plus the MP lock).

    The interrupt routing code has similar issues.  The code is designed to
    be per-cpu and tested in that context (by testing driver entry from other
    cpus), but all hardware interrupts are still being taken on cpu #0, and
    all polling is issued on cpu #0.  This adds considerable overhead,
    though it is mitigated somewhat by packet aggregation.

    There are two or three other non-algorithmic issues of that nature in
    the current network path that exist to allow the old algorithms to be
    migrated to the new ones and which are slowly being cleaned up.  I'm not
    at all surprised that all of these shims cost us 1.8 uS in overhead. 
    I've run end-to-end timing tests for a number of operations, which you
    can see from my BayLisa slides here:

	http://www.dragonflybsd.org/docs/LISA200512/

    What I have found is that the algorithms are sound and the extra overheads
    are basically just due to the migrationary hacks (like the malloc).
    Those tests also tested that our algorithms are capable of pipelining
    (MP safe wise) between the network interrupt and TCP or UDP protocol
    stacks, and they can with only about 40 ns of IPI messaging overhead.
    There are sysctls for testing the MP safe interrupt path, but they aren't
    production ready yet (because they aren't totally MP safe due to the
    route table, IP filter, and mbuf stats which are the only remaining
    items that need to be made MP safe).

    Frankly, I'm not really all that concerned about any of this.  Certainly
    not raw routing overhead (someone explain to me why you don't simply buy
    a cisco, or write a custom driver if you really need to pop packets
    between interfaces at 1 megapps instead of trying to use a piece of
    generic code in a generic operating system to do it).  Our focus is
    frankly never going to be on raw packet switching because there is no
    real-life situation where you would actually need to switch such a high
    packet rate where you wouldn't also have the budget to simply buy an
    off-the-shelf solution.

    Our focus vis-a-vie the network stack is going to be on terminus
    communications, meaning UDP and TCP services terminated or sourced on
    the machine.  All the algorithms have been proved out, the only thing
    preventing me from flipping the MP lock off are the aformentioned
    mbuf stats, route table, and packet filter code.  In fact, Jeff *has*
    turned off the MP lock for the TCP protocol threads for testing purposes,
    with very good results.  The route table is going to be fixed this month
    when we get Jeff's MPSAFE parallel route table code into the tree.  The
    mbuf stats are a non-problem, really, just some minor work.  The packet
    filter(s) are more of an issue.

    The numbers I ran for the BayLisa talk show our network interrupt overhead
    is around 1-1.5 uS per packet, and our TCP overhead is around
    1-1.5 uS per packet.  700 ns of that is the aformentioned malloc/free
    issue, and a good chunk of the remaining overhead is MP lock related.

:For stock FreeBSD-7-CURRENT from 28. Dec. 2005 I've got 580kpps with fast-
:forward enabled.  A em(4) patch from Scott Long implementing a taskqueue
:raised this to 729kpps.

    An interface lockup is a different matter.  Nothing can be said about 
    that until the cause of the problem is tracked down.  I can't speculate
    as to the problem without more information.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>