Running the network stack without Giant -- change in default
coming
Richard Coleman
rcoleman at criticalmagic.com
Tue Aug 24 08:38:54 PDT 2004
Very very cool. It's exciting to see many of the long term
FreeBSD projects coming together like this.
Richard Coleman
rcoleman at criticalmagic.com
Robert Watson wrote:
> For some time, one of the major goals of the FreeBSD Project has been
> to allow the network stack to run in parallel on multiple processors
> at a time. Per my July 19, 2004 post to the freebsd-current mailing
> list, much of this support has now been merged to the FreeBSD
> 5-CURRENT branch (and now 6-CURRENT), with the intent of shipping
> this support in 5.3. And, per that post, it's now possible to run
> large parts of the network stack in this manner through the use of a
> system tunable at boot, debug.mpsafenet. This can result in a variety
> of performance benefits, especially on SMP, by improving concurrency
> and reducing latency. While it presents a "first cut" locking
> strategy, these benefits are still pretty tangible, and the resulting
> system is an excellent starting architecture for a broad range of
> performance work.
>
> Right now, that tunable "debug.mpsafenet" defaults to off (0) in the
> 5-CURRENT and 6-CURRENT branches. However, this will shortly change
> in 6-CURRENT to on (1), as most commonly exercised parts of the
> network stack are now ready for testing in this environment. Some
> caveats before I go into the details as to how to determine whether
> this is right for you:
>
> - While we've been doing pretty heavy testing in MPSAFE
> configurations, the nature of multiprocessor development and adapting
> code for MP safety means that it's unlikely this will "just work" for
> every last person who tries it. However, it appears to work well in
> a broad variety of environments and with fairly strenuous testing.
>
> - We've focussed primarily on getting mainstream network
> configurations to run without Giant: this means that less mainstream
> subsystems (parts of IPv6, some netgraph nodes, IPX, etc) are
> currently unsafe without the Giant lock turned on. Less mainstream
> network devices, even if the device drivers are not able to run
> without the Giant lock. are able to operate without Giant over the
> remainder of the stack due to compatibility code. This code comes
> with a performance penalty beyond just running with the Giant lock,
> so there is a strong motivation to complete locking for these
> straggling drivers.
>
> - You may run into hard to diagnose problems. We'd like to try to
> diagnose them anyway, but if you start to experience new problems,
> you'll want to go read the Handbook chapter on preparing kernel bug
> reports and diagnosing problems. You'll also want to be prepared to
> run the system with INVARIANTS and WITNESS turned on. The first step
> in debugging will be to try running with Giant turned back on by
> changing the debug.mpsafenet flag and seeing if the problem can be
> reproduced. Details below.
>
> - Not all workloads will experience a performance benefit -- some,
> for various reasons, will get worse. However, several interesting
> performance loads get measurably better. If you don't see an
> improvement, or you see things get worse, please don't be surprised
> -- you may want to look at some of the suggestions I make below on
> ways to make the results more predictable. Generally, you shouldn't
> see substantial performance degradation, if any, but it can't be
> ruled out, especially due to outstanding scheduler issues that are
> being worked on.
>
> - We can and will destroy your data. We don't mean to, because we
> like your data (and you!), and we try not to, but this is, after all,
> operating system development, and comes with risks.
>
> With this in mind, now is a good time to increase exposure for these
> changes, because they will become the default in the near future.
>
> Here's some technical information on how to get started:
>
> (1) Determine if all of the stack components you will operate with
> are MPsafe. For common configurations, answering the following
> questions will help you decide this:
>
> - Are you actively using IPv6, IPX, ATM, or KAME IPSEC? If you
> answered yes to any of these questions, it is not yet safe for you to
> run without Giant. Note that most use of IPv6 is safe, but there are
> some areas (multicast) that are not entirely safe yet.
>
> - Are your using Netgraph? If yes, it may be that you are not yet
> able to run without Giant. The framework and many nodes are MPSAFE,
> but some remain that are not. It is worth giving it a try, but you
> may experience panics, etc, especially in MP configurations.
>
> - Are you using SLIP or kernel PPP (not to be confused with user ppp,
> which is what most FreeBSD users use with modems). If so, there are
> experimental patches to make SLIP safe, but out of the box you may
> see lock assertion failures. We are working to resolve this issue.
>
> - Are you using any physical network interfaces other than the
> following: ath, bge, dc, em, ep, fxp, rl, sis, xl, wi. If so, you
> may see a performance drop.
>
> NOTE: Do you maintain a network interface driver? Is it not on this
> list? Shame on you! Or maybe shame on me for not listing it, even
> though it should work. Drop me a private e-mail with any questions
> or comments. Please update the busdma driver status web page with
> your driver's status.
>
> (2) If you are comfortable that you are using an MPSAFE-supported
> configuration, then you can use the following tunable in loader.conf
> to disable the Giant lock over the network stack on your system:
>
> debug.mpsafenet="1"
>
> Note that this is a boot-time only flag; you can inspect the setting
> with a sysctl, but it cannot currently be changed at runtime. You
> will need to reboot for the change to take effect.
>
> Once the default has changed, it will be necessary to explicitly
> disable Giant-free networking if that is the desired operating mode.
> Specifically, you will need to place the following in loader.conf to
> get that mode of operation:
>
> debug.mpsafenet="0"
>
> Some notes:
>
> On SMP-centric performance measurements, such as local UNIX domain
> socket use by MySQL on MP systems, I've observed 30%-40% performance
> improvements by disabling Giant (some details below). My recommended
> configuration for testing out the impact of disabling Giant on MP
> systems is:
>
> - Running with adaptive mutexes (now the default) and with
> ADAPTIVE_GIANT (also now the default) appears to make a big
> difference.
>
> - Try disabling HTT. In my workloads, which tend to pound the
> kernel, HTT appears to hurt quite a bit. Obviously, the
> effectiveness of HTT depends on the instruction mix, so this may not
> be for you. Builds, for example, may benefit.
>
> - Pick one of ULE and 4BSD, and then try the other. I found 4BSD
> helped a lot for MySQL, but I've seen other benchmarks with quite
> different results.
>
> - For stability purposes with MySQL, I currently have to disable
> PREEMPTION (currently the default), as the MySQL benchmarks I use are
> pretty thread-centric and trigger preemption-related bugs with the
> kernel threading bits. Recent work-arounds committed should resolve
> this but I have not yet run stability tests.
>
> - If you want to measure performance, make sure to disable
> INVARIANTS, INVARIANTS_SUPPORT, WITNESS, etc. Also, confirm that the
> userland malloc debugging features are disabled, as they add cost to
> each free() operation. I believe we now have a handbook with a
> variety of recommendations on performance measurement, such as
> disabling various daemons (such as dhclient, etc). For latency
> measurements, PREEMPTION is generally desired, subject to stability.
>
> - To increase parallelism, especially for inbound packet paths on
> multiple interfaces, set the sysctl/tunable net.isr.enable=1, which
> enables direct dispatch in network interface ithreads, rather than
> defering to the netisr thread. If each interface is assigned a
> different ithread, their inbound processing paths can run in
> parallel, as well as with loop back traffic running in the global
> netisr thread. We have additional work to do here in terms of
> increasing the chances of parallel dispatch, etc, and it could be
> some environments this is not a useful setting. I'd be interested in
> learning about the environments where a negative performance impact
> is measured.
>
> Some notes on bug reporting:
>
> - Make sure to identify that you are running with debug.mpsafenet on.
> If the problem is reproduceable, make sure to indicate if it goes
> away or persists when you disable debug.mpsafenet. This will help to
> distinguish network stack problems which are (and are not) a result
> of this work.
>
> - If you appear to be experiencing a hang/deadlock, please try
> running with WITNESS. I'd actually like to see most people running
> with WITNESS for a bit to shake out lock order issues, as I've
> introduced a lot of orders. If experiencing lock order reversals,
> please include the full console warning including stack trace and any
> warning messages prior to the trace identifying locks, etc. If
> dropped to DDB, "show locks" is useful.
>
> - INVARIANTS also considered good. Even if you aren't running with
> WITNESS, do run with INVARIANTS. Note that there is a measurable
> performance hit for doing so.
>
> - If you experience a hang, see if you can get into DDB -- if you are
> having problems getting in using a console break, try a serial
> console. When debugging, at minimum DDB 'ps' output, along with
> traces of interesting processes. Typically interesting will be
> processes that appear to be involved in the hang, etc. Obviously,
> this requires some intuition about what causes the hang and I can't
> offer hard and fast rules here. NMI, SW_WATCHDOG, and MP_WATCHDOG
> can all increase the chances of getting to DDB even in hard hangs.
>
> - Experimenting with debug.mpsafenet=1 and UP is also interesting,
> not just SMP. With PREEMPTION turned on, it may result in lower
> latency and/or lower throughput. Or not. Regardless, it's
> interesting -- you don't have to have SMP to give it a spin.
>
> FYI, while results can and will vary, I was pleased to observe moving
> from a UP->MP speedup of 1.07 on a dual-processor box to a speedup of
> 1.42 with the supersmack benchmark using 11 workers and 1000 select
> transactions with MySQL. For reference, that was with the 4BSD
> scheduler and adaptive mutexes. For loopback netperf with TCP and
> UDP, I observed no change in performance (well, 1% better for UDP RR,
> but basically no change). Note that the MySQL benchmark here is
> basically a UNIX domain socket IPC test, and so real world databases
> will give pretty different results since they won't be pure IPC. The
> results appear to be very sensitive to the choice of scheduler, and
> for a variety of reasons I've preferred 4BSD during recent testing
> (not least, better results in terms of throughput).
>
> There are a lot of people who have been working on this for quite
> some time -- I can't thank them all here, but I will point at the
> netperf web page as a place to look for ongoing patches, change logs,
> and some credits:
>
> http://www.watson.org/~robert/freebsd/netperf/
>
> The hard work and contributions of these many developers over several
> years is finally coming to fruition! I try to keep it up to date
> about once a week or so as I drop new patch sets. There's also an
> RSS feed on the change log, which is fairly technical but might be
> interesting to some readers.
>
> Robert N M Watson FreeBSD Core Team, TrustedBSD Projects
> robert at fledge.watson.org Principal Research Scientist, McAfee
> Research
>
> _______________________________________________
> freebsd-current at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current To
> unsubscribe, send any mail to
> "freebsd-current-unsubscribe at freebsd.org"
More information about the freebsd-current
mailing list