"slow path" in network code || IPv6 panic on inteface removal

Fri Jan 24 07:36:24 UTC 2014

Hello guys!

Typically we're mostly interested in making "fast" paths in our code
running faster. However it seems it is time to take care of code which
is either called rarely or is quite complex in terms of relative code
size or/and locking.

Some good examples from current codebase are probably:
* L3->L2 mapping like ARP handling - while doing doing arpresolve we
discover there is no valid entry, so we start doing complex locking, are
request preparing/sending in the same piece of code. This washes out
both i/d caches and makes sending process _more_ unpredictable.
  Here we can queue given mbuf to delayed processing and return quickly.

* ip_fastfwd() handling corner cases. This is already optimized in terms
of splitting "fast" and "slow" code paths for all cases.

* ipfw(4) (and probably other pfil consumers) generating/sending various 
icmp/icmp6 packets for inbound mbuf

What exactly is proposed:
- Another one netisr queue for handling different types of packets
- metainfo is stored in mbuf_tag attached to packet
- ifnet departure handler taking care of packets queued from/to killed ifnet
- API to register/unregister/dispath given type of traffic

Real problem which is solved by this approach (traced by ae@):

We're using per-LLE IPv6 timers for various purposes, most of them
requires LLE modifications, so timer function starts with lle write lock
held.

Some timer events requires us to send neighbour solicication messages
which involves a) source address selection (requiring LLE lock being
held ) and b) calling ip6_output() which requires LLE lock being not
held. It is solved exactly as in IPv4 arp handling code: timer function
drops write lock before calling nd6_ns_output().

Dropping/acquiring lock is error-prone, for example, the following 
scenario is possible (traced by ae@):

we're calling if_detach(ifp) (thread 1) and nd6_llinfo_timer (thread 2).
Then the following can happen:

#1 T2 releases LLE lock and runs nd6_ns_output().
#2 T1 proceeds with detaching: in6_ifdetach() -> in6_purgeaddr() -> 
nd6_rem_ifa_lle() -> in6_lltable_prefix_free() which removes all LLEs 
for given prefix acquiring each LLE write lock. "Our" LLE is not 
destroyed since it is refcounted by nd6_llinfo_settimer_locked().

#3 T2 proceeds with nd6_ns_output() selecting source address (which 
involves acquiring LLE read lock)

#4 T1 finishes with detaching interface addresses and sets ifp->if_addr 
to NULL

#5 T2 calls nd6_ifptomac() which reads interface MAC from ifp->if_addr

#6 User inspects core generated by previous call

Using new API, we can avoid #6 by making the following code changes:
* LLE timer does not drop/reacquire LLE lock
* we require nd6_ns_output callers to lock LLE if it is provided
* nd6_ns_output() uses "slow" path instead of sending mbuf to 
ip6_output() immediately if LLE is not NULL.

What do you think?

-------------- next part --------------
A non-text attachment was scrubbed...
Name: dly_fin2.diff
Type: text/x-patch
Size: 21720 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-hackers/attachments/20140124/751b547c/attachment.bin>