Network stack changes

Wed Aug 28 18:30:46 UTC 2013

Hello list!

There is a lot constantly raising  discussions related to networking 
stack performance/changes.

I'll try to summarize current problems and possible solutions from my 
point of view.
(Generally this is one problem: stack is slooooooooooooooooooooooooooow, 
but we need to know why and what to do).

Let's start with current IPv4 packet flow on a typical router:
http://static.ipfw.ru/images/freebsd_ipv4_flow.png

(I'm sorry I can't provide this as text since Visio don't have any 
'ascii-art' exporter).

Note that we are using process-to-completion model, e.g. process any 
packet in ISR until it is either
consumed by L4+ stack or dropped or put to egress NIC queue.

(There is also deferred ISR model implemented inside netisr but it does 
not change much:
it can help to do more fine-grained hashing (for GRE or other similar 
traffic), but
1) it uses per-packet mutex locking which kills all performance
2) it currently does not have _any_ hashing functions (see absence of 
flags in `netstat -Q`)
People using http://static.ipfw.ru/patches/netisr_ip_flowid.diff (or 
modified PPPoe/GRE version)
report some profit, but without fixing (1) it can't help much
)

So, let's start:

1) Ixgbe uses mutex to protect each RX ring which is perfectly fine 
since there is nearly no contention
(the only thing that can happen is driver reconfiguration which is rare 
and, more signifficant, we do this once
for the batch of packets received in given interrupt). However, due to 
some (im)possible deadlocks current code
does per-packet ring unlock/lock (see ixgbe_rx_input()).
There was a discussion ended with nothing: 
http://lists.freebsd.org/pipermail/freebsd-net/2012-October/033520.html

1*) Possible BPF users. Here we have one rlock if there are any readers 
present
(and mutex for any matching packets, but this is more or less OK. 
Additionally, there is WIP to implement multiqueue BPF
and there is chance that we can reduce lock contention there). There is 
also an "optimize_writers" hack permitting applications
like CDP to use BPF as writers but not registering them as receivers 
(which implies rlock)

2/3) Virtual interfaces (laggs/vlans over lagg and other simular 
constructions).
Currently we simply use rlock to make s/ix0/lagg0/ and, what is much 
more funny - we use complex vlan_hash with another rlock to
get vlan interface from underlying one.

This is definitely not like things should be done and this can be 
changed more or less easily.

There are some useful terms/techniques in world of software/hardware 
routing: they have clear 'control plane' and 'data plane' separation.
Former one is for dealing control traffic (IGP, MLD, IGMP snooping, lagg 
hellos, ARP/NDP, etc..) and some data traffic (packets with TTL=1, with 
options, destined to hosts without ARP/NDP record, and similar). Latter 
one is done in hardware (or effective software implementation).
Control plane is responsible to provide data for efficient data plane 
operations. This is the point we are missing nearly everywhere.

What I want to say is: lagg is pure control-plane stuff and vlan is 
nearly the same. We can't apply this approach to complex cases like 
lagg-over-vlans-over-vlans-over-(pppoe_ng0-and_wifi0)
but we definitely can do this for most common setups like (igb* or ix* 
in lagg with or without vlans on top of lagg).

We already have some capabilities like VLANHWFILTER/VLANHWTAG, we can 
add some more. We even have per-driver hooks to program HW filtering.

One small step to do is to throw packet to vlan interface directly (P1), 
proof-of-concept(working in production):
http://lists.freebsd.org/pipermail/freebsd-net/2013-April/035270.html

Another is to change lagg packet accounting: 
http://lists.freebsd.org/pipermail/svn-src-all/2013-April/067570.html
Again, this is more like HW boxes do (aggregate all counters including 
errors) (and I can't imagine what real error we can get from _lagg_).

4) If we are router, we can do either slooow ip_input() -> ip_forward() 
-> ip_output() cycle or use optimized ip_fastfwd() which falls back to 
'slow' path for multicast/options/local traffic (e.g. works exactly like 
'data plane' part).
(Btw, we can consider net.inet.ip.fastforwarding to be turned on by 
default at least for non-IPSEC kernels)

Here we have to determine if this is local packet or not, e.g. F(dst_ip) 
returning 1 or 0. Currently we are simply using standard rlock + hash of 
iface addresses.
(And some consumers like ipfw(4) do the same, but without lock).
We don't need to do this! We can build sorted array of IPv4 addresses or 
other efficient structure on every address change and use it unlocked 
with delayed garbage collection (proof-of-concept attached)
(There is another thing to discuss: maybe we can do this once somewhere 
in ip_input and mark mbuf as 'local/non-local' ? )

5, 9) Currently we have L3 ingress/egress PFIL hooks protected by 
rmlocks. This is OK.

However, 6) and 7) are not.
Firewall can use the same pfil lock as reader protection without 
imposing its own lock. currently pfil&ipfw code is ready to do this.

8) Radix/rt* api. This is probably the worst place in entire stack. It 
is toooo generic, tooo slow and buggy (do you use IPv6? you definitely 
know what I'm talking about).
A) It really is too generic and assumption that it can be (effectively) 
used for every family is wrong. Two examples:
we don't need to lookup all 128 bits of IPv6 address. Subnets with mask 
 >/64 are not used widely (actually the only reason to use them are p2p 
links due to ND potential problems).
One of common solutions is to lookup 64bits, and build another trie (or 
other structure) in case of collision.
Another example is MPLS where we can simply do direct array lookup based 
on ingress label.

B) It is terribly slow (AFAIR luigi@ did some performance management, 
numbers available in one of netmap pdfs)
C) It is not multipath-capable. Stateful (and non-working) multipath is 
definitely not the right way.

8*) rtentry
We are doing it wrong.
Currently _every_ lookup locks/unlocks given rte twice.
First lock is related to and old-old story for trusting IP redirects 
(and auto-adding host routes for them). Hopefully currently it is 
disabled automatically when you turn forwarding on.
The second one is much more complicated: we are assuming that rte's with 
non-zero refcount value can stop egress interface from being destroyed.
This is wrong (but widely used) assumption.

We can use delayed GC instead of locking for rte's and this won't break 
things more than they are broken now (patch attached).
We can't do the same for ifp structures since
a) virtual ones can assume some state in underlying physical NIC
b) physical ones just _can_ be destroyed (maybe regardless of user wants 
this or not, like: SFP being unplugged from NIC) or simply lead to 
kernel crash due to SW/HW inconsistency

One of possible solution is to implement stable refcounts based on PCPU 
counters, and apply thos counters to ifp, but seem to be non-trivial.


Another rtalloc(9) problem is the fact that radix is used as both 
'control plane' and 'data plane' structure/api. Some users always want 
to put more information in rte, while others
want to make rte more compact. We just need _different_ structures for that.
Feature-rich, lot-of-data control plane one (to store everything we want 
to store, including, for example, PID of process originating the route) 
- current radix can be modified to do this.
And address-family-depended another structure (array, trie, or anything) 
which contains _only_ data necessary to put packet on the wire.

11) arpresolve. Currently (this was decoupled in 8.x) we have
a) ifaddr rlock
b) lle rlock.

We don't need those locks.
We need to
a) make lle layer per-interface instead of global (and this can also 
solve multiple fibs and L2 mappings done in fib.0 issue)
b) use rtalloc(9)-provided lock instead of separate locking
c) actually, we need to do rewrite this layer because
d) lle actually is the place to do real multipath:

briefly,
you have rte pointing to some special nexthop structure pointing to lle, 
which has the following data:
num_of_egress_ifaces: [ifindex1, ifindex2, ifindex3] | L2 data to 
prepend to header
Separate post will follow.

With the following, we can achieve lagg traffic distribution without 
actually using lagg_transmit and similar stuff (at least in most common 
scenarious)
(for example, TCP output definitely can benefit from this, since we can 
account flowid once for TCP session and use in in every mbuf)


So. Imagine we have done all this. How we can estimate the difference?

There was a thread, started a year ago, describing 'stock' performance 
and difference for various modifications.
It is done on 8.x, however I've got similar results on recent 9.x

http://lists.freebsd.org/pipermail/freebsd-net/2012-July/032680.html

Briefly:

2xE5645 @ Intel 82599 NIC.
Kernel: FreeBSD-8-S r237994, stock drivers, stock routing, no FLOWTABLE, 
no firewallIxia XM2 (traffic generator) <> ix0 (FreeBSD). Ixia sends 
64byte IP packets from vlan10 (10.100.0.64 - 10.100.0.156) to 
destinations in vlan11 (10.100.1.128 - 10.100.1.192). Static arps are 
configured for all destination addresses. Traffic level is slightly 
above or slightly below system performance.

we start from 1.4MPPS (if we are using several routes to minimize mutex 
contention).

My 'current' result for the same test, on same HW, with the following 
modifications:

* 1) ixgbe per-packet ring unlock removed
* P1) ixgbe is modified to do direct vlan input (so 2,3 are not used)
* 4) separate lockless in_localip() version
* 6) - using existing pfil lock
* 7) using lockless version
* 8) radix converted to use rmlock instead of rlock. Delayed GC is used 
instead of mutexes
* 10) - using existing pfil lock
* 11) using radix lock to do arpresolve(). Not using lle rlock

(so the rmlocks are the only locks used on data path).

Additionally, ipstat counters are converted to PCPU (no real performance 
implications).
ixgbe does not do per-packet accounting (as in head).
if_vlan counters are converted to PCPU
lagg is converted to rmlock, per-packet accounting is removed (using 
stat from underlying interfaces)
lle hash size is bumped to 1024 instead of 32 (not applicable here, but 
slows things down for large L2 domains)

The result is 5.6 MPPS for single port (11 cores) and 6.5MPPS for lagg 
(16 cores), nearly the same for HT on and 22 cores.

..
while Intel DPDK claims 80MPPS (and 6windgate talks about 160 or so) on 
the same-class hardware and _userland_ forwarding.

One of key features making all such products possible (DPDK, netmap, 
packetshader, Cisco SW forwarding) - is use of batching instead of 
process-to-completion model.
Batching mitigates locking cost, batching does not wash out CPU cache, 
and so on.

So maybe we can consider passing batches from NIC to at least L2 layer 
with netisr? or even up to ip_input() ?

Another question is about making some sort of reliable GC like ("passive 
serialization" or other similar not-to-pronounce-words about Linux and 
lockless objects).


P.S. Attached patches are 1) for 8.x 2) mostly 'hacks' showing roughly 
how can this be done and what benefit can be achieved.









-------------- next part --------------
commit 20a52503455c80cd149d2232bdc0d37e14381178
Author: Charlie Root <root at test15.yandex.net>
Date:   Tue Oct 23 21:20:13 2012 +0000

    Remove RX ring unlock/lock before calling if_input() from ixgbe drivers.

diff --git a/sys/dev/ixgbe/ixgbe.c b/sys/dev/ixgbe/ixgbe.c
index 5d8752b..fc1491e 100644
--- a/sys/dev/ixgbe/ixgbe.c
+++ b/sys/dev/ixgbe/ixgbe.c
@@ -4171,9 +4171,7 @@ ixgbe_rx_input(struct rx_ring *rxr, struct ifnet *ifp, struct mbuf *m, u32 ptype
                         if (tcp_lro_rx(&rxr->lro, m, 0) == 0)
                                 return;
         }
-	IXGBE_RX_UNLOCK(rxr);
         (*ifp->if_input)(ifp, m);
-	IXGBE_RX_LOCK(rxr);
 }
 
 static __inline void
-------------- next part --------------
Index: sys/dev/ixgbe/ixgbe.c
===================================================================
--- sys/dev/ixgbe/ixgbe.c	(revision 248704)
+++ sys/dev/ixgbe/ixgbe.c	(working copy)
@@ -2880,6 +2880,14 @@ ixgbe_allocate_queues(struct adapter *adapter)
 			error = ENOMEM;
 			goto err_rx_desc;
 		}
+
+		if ((rxr->vlans = malloc(sizeof(struct ifvlans), M_DEVBUF,
+		    M_NOWAIT | M_ZERO)) == NULL) {
+			device_printf(dev,
+			    "Critical Failure setting up vlan index\n");
+			error = ENOMEM;
+			goto err_rx_desc;
+		}
 	}
 
 	/*
@@ -4271,6 +4279,11 @@ ixgbe_free_receive_buffers(struct rx_ring *rxr)
 		rxr->ptag = NULL;
 	}
 
+	if (rxr->vlans != NULL) {
+		free(rxr->vlans, M_DEVBUF);
+		rxr->vlans = NULL;
+	}
+
 	return;
 }
 
@@ -4303,7 +4316,7 @@ ixgbe_rx_input(struct rx_ring *rxr, struct ifnet *
                                 return;
         }
 	IXGBE_RX_UNLOCK(rxr);
-        (*ifp->if_input)(ifp, m);
+        (*ifp->if_input)(m->m_pkthdr.rcvif, m);
 	IXGBE_RX_LOCK(rxr);
 }
 
@@ -4360,6 +4373,7 @@ ixgbe_rxeof(struct ix_queue *que)
 	u16			count = rxr->process_limit;
 	union ixgbe_adv_rx_desc	*cur;
 	struct ixgbe_rx_buf	*rbuf, *nbuf;
+	struct ifnet		*ifp_dst;
 
 	IXGBE_RX_LOCK(rxr);
 
@@ -4522,9 +4536,19 @@ ixgbe_rxeof(struct ix_queue *que)
 			    (staterr & IXGBE_RXD_STAT_VP))
 				vtag = le16toh(cur->wb.upper.vlan);
 			if (vtag) {
-				sendmp->m_pkthdr.ether_vtag = vtag;
-				sendmp->m_flags |= M_VLANTAG;
-			}
+				ifp_dst = rxr->vlans->idx[EVL_VLANOFTAG(vtag)];
+
+				if (ifp_dst != NULL) {
+					ifp_dst->if_ipackets++;
+					sendmp->m_pkthdr.rcvif = ifp_dst;
+				} else {
+					sendmp->m_pkthdr.ether_vtag = vtag;
+					sendmp->m_flags |= M_VLANTAG;
+					sendmp->m_pkthdr.rcvif = ifp;
+				}
+			} else
+				sendmp->m_pkthdr.rcvif = ifp;
+
 			if ((ifp->if_capenable & IFCAP_RXCSUM) != 0)
 				ixgbe_rx_checksum(staterr, sendmp, ptype);
 #if __FreeBSD_version >= 800000
@@ -4625,7 +4649,32 @@ ixgbe_rx_checksum(u32 staterr, struct mbuf * mp, u
 	return;
 }
 
+/*
+ * This routine gets real vlan ifp based on
+ * underlying ifp and vlan tag.
+ */
+static struct ifnet *
+ixgbe_get_vlan(struct ifnet *ifp, uint16_t vtag)
+{
 
+	/* XXX: IFF_MONITOR */
+#if 0
+	struct lagg_port *lp = ifp->if_lagg;
+	struct lagg_softc *sc = lp->lp_softc;
+
+	/* Skip lagg nesting */
+	while (ifp->if_type == IFT_IEEE8023ADLAG) {
+		lp = ifp->if_lagg;
+		sc = lp->lp_softc;
+		ifp = sc->sc_ifp;
+	}
+#endif
+	/* Get vlan interface based on tag */
+	ifp = VLAN_DEVAT(ifp, vtag);
+
+	return (ifp);
+}
+
 /*
 ** This routine is run via an vlan config EVENT,
 ** it enables us to use the HW Filter table since
@@ -4637,7 +4686,9 @@ static void
 ixgbe_register_vlan(void *arg, struct ifnet *ifp, u16 vtag)
 {
 	struct adapter	*adapter = ifp->if_softc;
-	u16		index, bit;
+	u16		index, bit, j;
+	struct rx_ring	*rxr;
+	struct ifnet	*ifv;
 
 	if (ifp->if_softc !=  arg)   /* Not our event */
 		return;
@@ -4645,7 +4696,20 @@ ixgbe_register_vlan(void *arg, struct ifnet *ifp,
 	if ((vtag == 0) || (vtag > 4095))	/* Invalid */
 		return;
 
+	ifv = ixgbe_get_vlan(ifp, vtag);
+
 	IXGBE_CORE_LOCK(adapter);
+
+	if (ifp->if_capenable & IFCAP_VLAN_HWFILTER) {
+		rxr = adapter->rx_rings;
+
+		for (j = 0; j < adapter->num_queues; j++, rxr++) {
+			IXGBE_RX_LOCK(rxr);
+			rxr->vlans->idx[vtag] = ifv;
+			IXGBE_RX_UNLOCK(rxr);
+		}
+	}
+
 	index = (vtag >> 5) & 0x7F;
 	bit = vtag & 0x1F;
 	adapter->shadow_vfta[index] |= (1 << bit);
@@ -4663,7 +4727,8 @@ static void
 ixgbe_unregister_vlan(void *arg, struct ifnet *ifp, u16 vtag)
 {
 	struct adapter	*adapter = ifp->if_softc;
-	u16		index, bit;
+	u16		index, bit, j;
+	struct rx_ring	*rxr;
 
 	if (ifp->if_softc !=  arg)
 		return;
@@ -4672,6 +4737,15 @@ ixgbe_unregister_vlan(void *arg, struct ifnet *ifp
 		return;
 
 	IXGBE_CORE_LOCK(adapter);
+
+	rxr = adapter->rx_rings;
+
+	for (j = 0; j < adapter->num_queues; j++, rxr++) {
+		IXGBE_RX_LOCK(rxr);
+		rxr->vlans->idx[vtag] = NULL;
+		IXGBE_RX_UNLOCK(rxr);
+	}
+
 	index = (vtag >> 5) & 0x7F;
 	bit = vtag & 0x1F;
 	adapter->shadow_vfta[index] &= ~(1 << bit);
@@ -4686,8 +4760,8 @@ ixgbe_setup_vlan_hw_support(struct adapter *adapte
 {
 	struct ifnet 	*ifp = adapter->ifp;
 	struct ixgbe_hw *hw = &adapter->hw;
+	u32		ctrl, j;
 	struct rx_ring	*rxr;
-	u32		ctrl;
 
 
 	/*
@@ -4713,6 +4787,15 @@ ixgbe_setup_vlan_hw_support(struct adapter *adapte
 	if (ifp->if_capenable & IFCAP_VLAN_HWFILTER) {
 		ctrl &= ~IXGBE_VLNCTRL_CFIEN;
 		ctrl |= IXGBE_VLNCTRL_VFE;
+	} else {
+		/* Zero vlan table */
+		rxr = adapter->rx_rings;
+
+		for (j = 0; j < adapter->num_queues; j++, rxr++) {
+			IXGBE_RX_LOCK(rxr);
+			memset(rxr->vlans->idx, 0, sizeof(struct ifvlans));
+			IXGBE_RX_UNLOCK(rxr);
+		}
 	}
 	if (hw->mac.type == ixgbe_mac_82598EB)
 		ctrl |= IXGBE_VLNCTRL_VME;
Index: sys/dev/ixgbe/ixgbe.h
===================================================================
--- sys/dev/ixgbe/ixgbe.h	(revision 248704)
+++ sys/dev/ixgbe/ixgbe.h	(working copy)
@@ -284,6 +284,11 @@ struct ix_queue {
 	u64			irqs;
 };
 
+struct ifvlans {
+	struct ifnet 		*idx[4096];
+};
+
+
 /*
  * The transmit ring, one per queue
  */
@@ -307,7 +312,6 @@ struct tx_ring {
 	}			queue_status;
 	u32			txd_cmd;
 	bus_dma_tag_t		txtag;
-	char			mtx_name[16];
 #ifndef IXGBE_LEGACY_TX
 	struct buf_ring		*br;
 	struct task		txq_task;
@@ -324,6 +328,7 @@ struct tx_ring {
 	unsigned long   	no_tx_dma_setup;
 	u64			no_desc_avail;
 	u64			total_packets;
+	char			mtx_name[16];
 };
 
 
@@ -346,8 +351,8 @@ struct rx_ring {
 	u16			num_desc;
 	u16			mbuf_sz;
 	u16			process_limit;
-	char			mtx_name[16];
 	struct ixgbe_rx_buf	*rx_buffers;
+	struct ifvlans		*vlans;
 	bus_dma_tag_t		ptag;
 
 	u32			bytes; /* Used for AIM calc */
@@ -363,6 +368,7 @@ struct rx_ring {
 #ifdef IXGBE_FDIR
 	u64			flm;
 #endif
+	char			mtx_name[16];
 };
 
 /* Our adapter structure */
-------------- next part --------------
commit 7f1103ac622881182642b2d3ae17b6ff484c1293
Author: Charlie Root <root at test15.yandex.net>
Date:   Sun Apr 7 23:50:26 2013 +0000

    Use lockles in_localip_fast() function.

diff --git a/sys/net/route.h b/sys/net/route.h
index 4d9371b..f588f03 100644
--- a/sys/net/route.h
+++ b/sys/net/route.h
@@ -365,6 +365,7 @@ void 	 rt_maskedcopy(struct sockaddr *, struct sockaddr *, struct sockaddr *);
  */
 #define RTGC_ROUTE	1
 #define RTGC_IF		3
+#define	RTGC_IFADDR	4
 
 
 int	 rtexpunge(struct rtentry *);
diff --git a/sys/netinet/in.c b/sys/netinet/in.c
index 5341918..a83b8a9 100644
--- a/sys/netinet/in.c
+++ b/sys/netinet/in.c
@@ -93,6 +93,20 @@ VNET_DECLARE(struct inpcbinfo, ripcbinfo);
 VNET_DECLARE(struct arpstat, arpstat);  /* ARP statistics, see if_arp.h */
 #define	V_arpstat		VNET(arpstat)
 
+struct in_ifaddrf {
+	struct in_ifaddrf *next;
+	struct in_addr addr;
+};
+
+struct in_ifaddrhashf {
+	uint32_t hmask;
+	uint32_t count;
+	struct in_ifaddrf **hash;
+};
+
+VNET_DEFINE(struct in_ifaddrhashf *, in_ifaddrhashtblf) = NULL; /* inet addr fast hash table */
+#define	V_in_ifaddrhashtblf	VNET(in_ifaddrhashtblf)
+
 /*
  * Return 1 if an internet address is for a ``local'' host
  * (one to which we have a connection).  If subnetsarelocal
@@ -145,6 +159,120 @@ in_localip(struct in_addr in)
 	return (0);
 }
 
+int
+in_localip_fast(struct in_addr in)
+{
+	struct in_ifaddrf *rec;
+	struct in_ifaddrhashf *f;
+
+	if ((f = V_in_ifaddrhashtblf) == NULL)
+		return (0);
+
+	rec = f->hash[INADDR_HASHVAL(in) & f->hmask];
+
+	while (rec != NULL && rec->addr.s_addr != in.s_addr)
+		rec = rec->next;
+
+	if (rec != NULL)
+		return (1);
+
+	return (0);
+}
+
+struct in_ifaddrhashf *
+in_hash_alloc(int additional)
+{
+	int count, hsize, i;
+	struct in_ifaddr *ia;
+	struct in_ifaddrhashf *new;
+
+	count = additional + 1;
+
+	IN_IFADDR_RLOCK();
+	for (i = 0; i < INADDR_NHASH; i++) {
+		LIST_FOREACH(ia, &V_in_ifaddrhashtbl[i], ia_hash)
+			count++;
+	}
+	IN_IFADDR_RUNLOCK();
+
+	/* roundup to the next power of 2 */
+	hsize = (1UL << flsl(count - 1));
+
+	new = malloc(sizeof(struct in_ifaddrhashf) +
+	    sizeof(void *) * hsize +
+	    sizeof(struct in_ifaddrf) * count, M_IFADDR,
+	    M_NOWAIT | M_ZERO);
+
+	if (new == NULL)
+		return (NULL);
+
+	new->count = count;
+	new->hmask = hsize - 1;
+	new->hash = (struct in_ifaddrf **)(new + 1);
+
+	return (new);
+}
+
+int
+in_hash_build(struct in_ifaddrhashf *new)
+{
+	struct in_ifaddr *ia;
+	int i, j, count, hsize, r;
+	struct in_ifaddrhashf *old;
+	struct in_ifaddrf *rec, *tmp;
+
+	count = new->count - 1;
+	hsize = new->hmask + 1;
+	rec = (struct in_ifaddrf *)&new->hash[hsize];
+
+	IN_IFADDR_RLOCK();
+	for (i = 0; i < INADDR_NHASH; i++) {
+		LIST_FOREACH(ia, &V_in_ifaddrhashtbl[i], ia_hash) {
+			rec->addr.s_addr = IA_SIN(ia)->sin_addr.s_addr;
+
+			j = INADDR_HASHVAL(rec->addr) & new->hmask;
+			if ((tmp = new->hash[j]) == NULL)
+				new->hash[j] = rec;
+			else {
+				while (tmp->next)
+					tmp = tmp->next;
+				tmp->next = rec;
+			}
+
+			rec++;
+			count--;
+
+			/* End of memory */
+			if (count < 0)
+				break;
+		}
+
+		/* End of memory */
+		if (count < 0)
+			break;
+	}
+	IN_IFADDR_RUNLOCK();
+
+	/* If count >0 then we succeeded in building hash. Stop cycle */
+
+	if (count >= 0) {
+		old = V_in_ifaddrhashtblf;
+		V_in_ifaddrhashtblf = new;
+
+		rtgc_free(RTGC_IFADDR, old, 0);
+
+		return (1);
+	}
+
+	/* Fail. */
+	if (new)
+		free(new, M_IFADDR);
+
+	return (0);
+}
+
+
+
 /*
  * Determine whether an IP address is in a reserved set of addresses
  * that may not be forwarded, or whether datagrams to that destination
@@ -239,6 +367,7 @@ in_control(struct socket *so, u_long cmd, caddr_t data, struct ifnet *ifp,
 	struct sockaddr_in oldaddr;
 	int error, hostIsNew, iaIsNew, maskIsNew;
 	int iaIsFirst;
+	struct in_ifaddrhashf *new_hash;
 
 	ia = NULL;
 	iaIsFirst = 0;
@@ -405,6 +534,11 @@ in_control(struct socket *so, u_long cmd, caddr_t data, struct ifnet *ifp,
 				goto out;
 			}
 
+			if ((new_hash = in_hash_alloc(1)) == NULL) {
+				error = ENOBUFS;
+				goto out;
+			}
+
 			ifa = &ia->ia_ifa;
 			ifa_init(ifa);
 			ifa->ifa_addr = (struct sockaddr *)&ia->ia_addr;
@@ -427,6 +561,8 @@ in_control(struct socket *so, u_long cmd, caddr_t data, struct ifnet *ifp,
 			IN_IFADDR_WLOCK();
 			TAILQ_INSERT_TAIL(&V_in_ifaddrhead, ia, ia_link);
 			IN_IFADDR_WUNLOCK();
+
+			in_hash_build(new_hash);
 			iaIsNew = 1;
 		}
 		break;
@@ -649,6 +785,8 @@ in_control(struct socket *so, u_long cmd, caddr_t data, struct ifnet *ifp,
 			ifa_free(&if_ia->ia_ifa);
 	} else
 		IN_IFADDR_WUNLOCK();
+	if ((new_hash = in_hash_alloc(0)) != NULL)
+		in_hash_build(new_hash);
 	ifa_free(&ia->ia_ifa);				/* in_ifaddrhead */
 out:
 	if (ia != NULL)
@@ -852,6 +990,7 @@ in_ifinit(struct ifnet *ifp, struct in_ifaddr *ia, struct sockaddr_in *sin,
 	register u_long i = ntohl(sin->sin_addr.s_addr);
 	struct sockaddr_in oldaddr;
 	int s = splimp(), flags = RTF_UP, error = 0;
+	struct in_ifaddrhashf *new_hash;
 
 	oldaddr = ia->ia_addr;
 	if (oldaddr.sin_family == AF_INET)
@@ -862,6 +1001,9 @@ in_ifinit(struct ifnet *ifp, struct in_ifaddr *ia, struct sockaddr_in *sin,
 		LIST_INSERT_HEAD(INADDR_HASH(ia->ia_addr.sin_addr.s_addr),
 		    ia, ia_hash);
 		IN_IFADDR_WUNLOCK();
+
+		if ((new_hash = in_hash_alloc(1)) != NULL)
+			in_hash_build(new_hash);
 	}
 	/*
 	 * Give the interface a chance to initialize
@@ -887,6 +1029,8 @@ in_ifinit(struct ifnet *ifp, struct in_ifaddr *ia, struct sockaddr_in *sin,
 				 */
 				LIST_REMOVE(ia, ia_hash);
 			IN_IFADDR_WUNLOCK();
+			if ((new_hash = in_hash_alloc(1)) != NULL)
+				in_hash_build(new_hash);
 			return (error);
 		}
 	}
diff --git a/sys/netinet/in.h b/sys/netinet/in.h
index b03e74c..948938a 100644
--- a/sys/netinet/in.h
+++ b/sys/netinet/in.h
@@ -741,6 +741,7 @@ int	 in_broadcast(struct in_addr, struct ifnet *);
 int	 in_canforward(struct in_addr);
 int	 in_localaddr(struct in_addr);
 int	 in_localip(struct in_addr);
+int	 in_localip_fast(struct in_addr);
 int	 inet_aton(const char *, struct in_addr *); /* in libkern */
 char	*inet_ntoa(struct in_addr); /* in libkern */
 char	*inet_ntoa_r(struct in_addr ina, char *buf); /* in libkern */
diff --git a/sys/netinet/ip_fastfwd.c b/sys/netinet/ip_fastfwd.c
index 692e3e5..f7734a9 100644
--- a/sys/netinet/ip_fastfwd.c
+++ b/sys/netinet/ip_fastfwd.c
@@ -347,7 +347,7 @@ ip_fastforward(struct mbuf *m)
 	/*
 	 * Is it for a local address on this host?
 	 */
-	if (in_localip(ip->ip_dst))
+	if (in_localip_fast(ip->ip_dst))
 		return m;
 
 	//IPSTAT_INC(ips_total);
@@ -390,7 +390,7 @@ ip_fastforward(struct mbuf *m)
 		/*
 		 * Is it now for a local address on this host?
 		 */
-		if (in_localip(dest))
+		if (in_localip_fast(dest))
 			goto forwardlocal;
 		/*
 		 * Go on with new destination address
@@ -479,7 +479,7 @@ passin:
 		/*
 		 * Is it now for a local address on this host?
 		 */
-		if (m->m_flags & M_FASTFWD_OURS || in_localip(dest)) {
+		if (m->m_flags & M_FASTFWD_OURS || in_localip_fast(dest)) {
 forwardlocal:
 			/*
 			 * Return packet for processing by ip_input().
diff --git a/sys/netinet/ipfw/ip_fw2.c b/sys/netinet/ipfw/ip_fw2.c
index b76a638..53f6e97 100644
--- a/sys/netinet/ipfw/ip_fw2.c
+++ b/sys/netinet/ipfw/ip_fw2.c
@@ -1450,10 +1450,7 @@ do {								\
 
 			case O_IP_SRC_ME:
 				if (is_ipv4) {
-					struct ifnet *tif;
-
-					INADDR_TO_IFP(src_ip, tif);
-					match = (tif != NULL);
+					match = in_localip_fast(src_ip);
 					break;
 				}
 #ifdef INET6
@@ -1490,10 +1487,7 @@ do {								\
 
 			case O_IP_DST_ME:
 				if (is_ipv4) {
-					struct ifnet *tif;
-
-					INADDR_TO_IFP(dst_ip, tif);
-					match = (tif != NULL);
+					match = in_localip_fast(dst_ip);
 					break;
 				}
 #ifdef INET6
diff --git a/sys/netinet/ipfw/ip_fw_pfil.c b/sys/netinet/ipfw/ip_fw_pfil.c
index a21f501..bdf8beb 100644
--- a/sys/netinet/ipfw/ip_fw_pfil.c
+++ b/sys/netinet/ipfw/ip_fw_pfil.c
@@ -184,7 +184,7 @@ again:
 		bcopy(args.next_hop, (fwd_tag+1), sizeof(struct sockaddr_in));
 		m_tag_prepend(*m0, fwd_tag);
 
-		if (in_localip(args.next_hop->sin_addr))
+		if (in_localip_fast(args.next_hop->sin_addr))
 			(*m0)->m_flags |= M_FASTFWD_OURS;
 	    }
 #endif /* INET || INET6 */
-------------- next part --------------
commit 67a74d91a7b4a47a83fcfa5e79a6c6f0b4b1122d
Author: Charlie Root <root at test15.yandex.net>
Date:   Fri Oct 26 17:10:52 2012 +0000

    Remove rte locking for IPv4. Remove one of 2 locks from IPv6 rtes

diff --git a/sys/net/if.c b/sys/net/if.c
index a875326..eb6a723 100644
--- a/sys/net/if.c
+++ b/sys/net/if.c
@@ -487,6 +487,13 @@ if_alloc(u_char type)
 	return (ifp);
 }
 
+
+void
+if_free_real(struct ifnet *ifp)
+{
+	free(ifp, M_IFNET);
+}
+
 /*
  * Do the actual work of freeing a struct ifnet, and layer 2 common
  * structure.  This call is made when the last reference to an
@@ -499,6 +506,15 @@ if_free_internal(struct ifnet *ifp)
 	KASSERT((ifp->if_flags & IFF_DYING),
 	    ("if_free_internal: interface not dying"));
 
+	if (rtgc_is_enabled()) {
+		/* 
+		 * FIXME: Sleep some time to permit packets
+		 * using fastforwarding routine without locking
+		 * die withour side effects.
+		 */
+		pause("if_free_gc", hz / 20); /* Sleep 50 milliseconds */
+	}
+
 	if (if_com_free[ifp->if_alloctype] != NULL)
 		if_com_free[ifp->if_alloctype](ifp->if_l2com,
 		    ifp->if_alloctype);
@@ -511,7 +527,10 @@ if_free_internal(struct ifnet *ifp)
 	IF_AFDATA_DESTROY(ifp);
 	IF_ADDR_LOCK_DESTROY(ifp);
 	ifq_delete(&ifp->if_snd);
-	free(ifp, M_IFNET);
+	if (rtgc_is_enabled())
+		rtgc_free(RTGC_IF, ifp, 0);
+	else
+		if_free_real(ifp);
 }
 
 /*
diff --git a/sys/net/if_var.h b/sys/net/if_var.h
index 39c499f..5ef6264 100644
--- a/sys/net/if_var.h
+++ b/sys/net/if_var.h
@@ -857,6 +857,7 @@ void	if_down(struct ifnet *);
 struct ifmultiaddr *
 	if_findmulti(struct ifnet *, struct sockaddr *);
 void	if_free(struct ifnet *);
+void	if_free_real(struct ifnet *);
 void	if_free_type(struct ifnet *, u_char);
 void	if_initname(struct ifnet *, const char *, int);
 void	if_link_state_change(struct ifnet *, int);
diff --git a/sys/net/route.c b/sys/net/route.c
index 3059f5a..97965b3 100644
--- a/sys/net/route.c
+++ b/sys/net/route.c
@@ -142,6 +142,175 @@ VNET_DEFINE(int, rttrash);		/* routes not in table but not freed */
 static VNET_DEFINE(uma_zone_t, rtzone);		/* Routing table UMA zone. */
 #define	V_rtzone	VNET(rtzone)
 
+SYSCTL_NODE(_net, OID_AUTO, gc, CTLFLAG_RW, 0, "Garbage collector");
+
+MALLOC_DEFINE(M_RTGC, "rtgc", "route GC");
+void rtgc_func(void *_unused);
+void rtfree_real(struct rtentry *rt);
+
+int _rtgc_default_enabled = 1;
+TUNABLE_INT("net.gc.enable", &_rtgc_default_enabled);
+
+#define	RTGC_CALLOUT_DELAY	1
+#define	RTGC_EXPIRE_DELAY	3
+
+VNET_DEFINE(struct mtx, rtgc_mtx);
+#define	V_rtgc_mtx	VNET(rtgc_mtx)
+VNET_DEFINE(struct callout, rtgc_callout);
+#define	V_rtgc_callout	VNET(rtgc_callout)
+VNET_DEFINE(int, rtgc_enabled);
+#define	V_rtgc_enabled	VNET(rtgc_enabled)
+SYSCTL_VNET_INT(_net_gc, OID_AUTO, enable, CTLFLAG_RW,
+	&VNET_NAME(rtgc_enabled), 1,
+	"Enable garbage collector");
+VNET_DEFINE(int, rtgc_expire_delay) = RTGC_EXPIRE_DELAY;
+#define	V_rtgc_expire_delay	VNET(rtgc_expire_delay)
+SYSCTL_VNET_INT(_net_gc, OID_AUTO, expire, CTLFLAG_RW,
+	&VNET_NAME(rtgc_expire_delay), 1,
+	"Object expiration delay");
+VNET_DEFINE(int, rtgc_numfailures);
+#define	V_rtgc_numfailures	VNET(rtgc_numfailures)
+SYSCTL_VNET_INT(_net_gc, OID_AUTO, failures, CTLFLAG_RD,
+	&VNET_NAME(rtgc_numfailures), 0,
+	"Number of objects leaked from route garbage collector");
+VNET_DEFINE(int, rtgc_numqueued);
+#define	V_rtgc_numqueued	VNET(rtgc_numqueued)
+SYSCTL_VNET_INT(_net_gc, OID_AUTO, queued, CTLFLAG_RD,
+	&VNET_NAME(rtgc_numqueued), 0,
+	"Number of objects queued for deletion");
+VNET_DEFINE(int, rtgc_numfreed);
+#define	V_rtgc_numfreed	VNET(rtgc_numfreed)
+SYSCTL_VNET_INT(_net_gc, OID_AUTO, freed, CTLFLAG_RD,
+	&VNET_NAME(rtgc_numfreed), 0,
+	"Number of objects deleted");
+VNET_DEFINE(int, rtgc_numinvoked);
+#define	V_rtgc_numinvoked	VNET(rtgc_numinvoked)
+SYSCTL_VNET_INT(_net_gc, OID_AUTO, invoked, CTLFLAG_RD,
+	&VNET_NAME(rtgc_numinvoked), 0,
+	"Number of times GC was invoked");
+
+struct rtgc_item {
+	time_t	expire;	/* Whe we can delete this entry */
+	int 	etype;	/* Entry type */
+	void	*data;	/* data to free */
+	TAILQ_ENTRY(rtgc_item)	items;
+};
+
+VNET_DEFINE(TAILQ_HEAD(, rtgc_item), rtgc_queue);
+#define	V_rtgc_queue	VNET(rtgc_queue)
+
+int
+rtgc_is_enabled()
+{
+	return V_rtgc_enabled;
+}
+
+void
+rtgc_func(void *_unused)
+{
+	struct rtgc_item *item, *temp_item;
+	TAILQ_HEAD(, rtgc_item) rtgc_tq;
+	int empty, deleted;
+
+	CTR2(KTR_NET, "%s: started with %d objects", __func__, V_rtgc_numqueued);
+
+	TAILQ_INIT(&rtgc_tq);
+
+	/* Move all contents of current queue to new empty queue */
+	mtx_lock(&V_rtgc_mtx);
+	V_rtgc_numinvoked++;
+	TAILQ_SWAP(&rtgc_queue, &rtgc_tq, rtgc_item, items);
+	mtx_unlock(&V_rtgc_mtx);
+
+	deleted = 0;
+
+	/* Dispatch as much as we can */
+	TAILQ_FOREACH_SAFE(item, &rtgc_tq, items, temp_item) {
+		if (item->expire > time_uptime)
+			break;
+
+		/* We can definitely delete this item */
+		TAILQ_REMOVE(&rtgc_tq, item, items);
+
+		switch (item->etype) {
+		case RTGC_ROUTE:
+			CTR1(KTR_NET, "Freeing route structure %p", item->data);
+			rtfree_real((struct rtentry *)item->data);
+			break;
+		case RTGC_IF:
+			CTR1(KTR_NET, "Freeing iface structure %p", item->data);
+			if_free_real((struct ifnet *)item->data);
+			break;
+		default:
+			CTR2(KTR_NET, "Unknown type: %d %p", item->etype, item->data);
+			break;
+		}
+
+		/* Remove item itself */
+		free(item, M_RTGC);
+		deleted++;
+	}
+
+	/*
+	 * Add remaining data back to mail queue.
+	 * Note items are still sorted by time_uptime after merge.
+	 */
+
+	mtx_lock(&V_rtgc_mtx);
+	/* Add new items to the end of our temporary queue */
+	TAILQ_CONCAT(&rtgc_tq, &rtgc_queue, items);
+	/* Move items back to stable storage */
+	TAILQ_SWAP(&rtgc_queue, &rtgc_tq, rtgc_item, items);
+	/* Check if we need to run callout another time */
+	empty = TAILQ_EMPTY(&rtgc_queue);
+	/* Update counters */
+	V_rtgc_numfreed += deleted;
+	V_rtgc_numqueued -= deleted;
+	mtx_unlock(&V_rtgc_mtx);
+
+	CTR4(KTR_NET, "%s: ended with %d object(s) (%d deleted), callout: %s",
+		__func__, V_rtgc_numqueued, deleted, empty ? "stopped" : "sheduled");
+	/* Schedule ourself iff there are items to delete */
+	if (!empty)
+		callout_reset(&V_rtgc_callout, hz * RTGC_CALLOUT_DELAY, rtgc_func, NULL);
+}
+
+void
+rtgc_free(int etype, void *data, int can_sleep)
+{
+	struct rtgc_item *item;
+
+	item = malloc(sizeof(struct rtgc_item), M_RTGC, (can_sleep ? M_WAITOK : M_NOWAIT) | M_ZERO);
+	if (item == NULL) {
+		V_rtgc_numfailures++; /* XXX: locking */
+		return; /* Skip route freeing. Memory leak is much better than panic */
+	}
+
+	item->expire = time_uptime + V_rtgc_expire_delay;
+	item->etype = etype;
+	item->data = data;
+
+	if ((!can_sleep) && (mtx_trylock(&V_rtgc_mtx) == 0)) {
+		/* Fail to acquire lock. Add another leak */
+		free(item, M_RTGC);
+		V_rtgc_numfailures++; /* XXX: locking */
+		return;
+	}
+
+	if (can_sleep)
+		mtx_lock(&V_rtgc_mtx);
+
+	TAILQ_INSERT_TAIL(&rtgc_queue, item, items);
+	V_rtgc_numqueued++;
+
+	mtx_unlock(&V_rtgc_mtx);
+
+	/* Schedule callout if not running */
+	if (!callout_pending(&V_rtgc_callout))
+		callout_reset(&V_rtgc_callout, hz * RTGC_CALLOUT_DELAY, rtgc_func, NULL);
+}
+
+
 /*
  * handler for net.my_fibnum
  */
@@ -241,6 +410,17 @@ vnet_route_init(const void *unused __unused)
 			dom->dom_rtattach((void **)rnh, dom->dom_rtoffset);
 		}
 	}
+
+	/* Init garbage collector */
+	mtx_init(&V_rtgc_mtx, "routeGC", NULL, MTX_DEF);
+	/* Init queue */
+	TAILQ_INIT(&V_rtgc_queue);
+	/* Init garbage callout */
+	memset(&V_rtgc_callout, 0, sizeof(rtgc_callout));
+	callout_init(&V_rtgc_callout, 1);
+	/* Set default from loader tunable */
+	V_rtgc_enabled = _rtgc_default_enabled;
+	//callout_reset(&V_rtgc_callout, 3 * hz, &rtgc_func, NULL);
 }
 VNET_SYSINIT(vnet_route_init, SI_SUB_PROTO_DOMAIN, SI_ORDER_FOURTH,
     vnet_route_init, 0);
@@ -351,6 +531,74 @@ rtalloc1(struct sockaddr *dst, int report, u_long ignflags)
 }
 
 struct rtentry *
+rtalloc1_fib_nolock(struct sockaddr *dst, int report, u_long ignflags,
+		    u_int fibnum)
+{
+	struct radix_node_head *rnh;
+	struct radix_node *rn;
+	struct rtentry *newrt;
+	struct rt_addrinfo info;
+	int err = 0, msgtype = RTM_MISS;
+	int needlock;
+
+	KASSERT((fibnum < rt_numfibs), ("rtalloc1_fib: bad fibnum"));
+	switch (dst->sa_family) {
+	case AF_INET6:
+	case AF_INET:
+		/* We support multiple FIBs. */
+		break;
+	default:
+		fibnum = RT_DEFAULT_FIB;
+		break;
+	}
+	rnh = rt_tables_get_rnh(fibnum, dst->sa_family);
+	newrt = NULL;
+	if (rnh == NULL)
+		goto miss;
+
+	/*
+	 * Look up the address in the table for that Address Family
+	 */
+	needlock = !(ignflags & RTF_RNH_LOCKED);
+	if (needlock)
+		RADIX_NODE_HEAD_RLOCK(rnh);
+#ifdef INVARIANTS	
+	else
+		RADIX_NODE_HEAD_LOCK_ASSERT(rnh);
+#endif
+	rn = rnh->rnh_matchaddr(dst, rnh);
+	if (rn && ((rn->rn_flags & RNF_ROOT) == 0)) {
+		newrt = RNTORT(rn);
+		if (needlock)
+			RADIX_NODE_HEAD_RUNLOCK(rnh);
+		goto done;
+
+	} else if (needlock)
+		RADIX_NODE_HEAD_RUNLOCK(rnh);
+	
+	/*
+	 * Either we hit the root or couldn't find any match,
+	 * Which basically means
+	 * "caint get there frm here"
+	 */
+miss:
+	V_rtstat.rts_unreach++;
+
+	if (report) {
+		/*
+		 * If required, report the failure to the supervising
+		 * Authorities.
+		 * For a delete, this is not an error. (report == 0)
+		 */
+		bzero(&info, sizeof(info));
+		info.rti_info[RTAX_DST] = dst;
+		rt_missmsg_fib(msgtype, &info, 0, err, fibnum);
+	}	
+done:
+	return (newrt);
+}
+
+struct rtentry *
 rtalloc1_fib(struct sockaddr *dst, int report, u_long ignflags,
 		    u_int fibnum)
 {
@@ -422,6 +670,23 @@ done:
 	return (newrt);
 }
 
+
+void
+rtfree_real(struct rtentry *rt)
+{
+	/*
+	 * The key is separatly alloc'd so free it (see rt_setgate()).
+	 * This also frees the gateway, as they are always malloc'd
+	 * together.
+	 */
+	Free(rt_key(rt));
+	
+	/*
+	 * and the rtentry itself of course
+	 */
+	uma_zfree(V_rtzone, rt);
+}
+
 /*
  * Remove a reference count from an rtentry.
  * If the count gets low enough, take it out of the routing table
@@ -484,18 +749,13 @@ rtfree(struct rtentry *rt)
 		 */
 		if (rt->rt_ifa)
 			ifa_free(rt->rt_ifa);
-		/*
-		 * The key is separatly alloc'd so free it (see rt_setgate()).
-		 * This also frees the gateway, as they are always malloc'd
-		 * together.
-		 */
-		Free(rt_key(rt));
 
-		/*
-		 * and the rtentry itself of course
-		 */
 		RT_LOCK_DESTROY(rt);
-		uma_zfree(V_rtzone, rt);
+
+		if (V_rtgc_enabled)
+			rtgc_free(RTGC_ROUTE, rt, 0);
+		else
+			rtfree_real(rt);
 		return;
 	}
 done:
diff --git a/sys/net/route.h b/sys/net/route.h
index b26ac44..3aa694d 100644
--- a/sys/net/route.h
+++ b/sys/net/route.h
@@ -363,9 +363,14 @@ void 	 rt_maskedcopy(struct sockaddr *, struct sockaddr *, struct sockaddr *);
  *
  *    RTFREE() uses an unlocked entry.
  */
+#define RTGC_ROUTE	1
+#define RTGC_IF		3
+
 
 int	 rtexpunge(struct rtentry *);
 void	 rtfree(struct rtentry *);
+void	 rtgc_free(int etype, void *data, int can_sleep);
+int	rtgc_is_enabled(void);
 int	 rt_check(struct rtentry **, struct rtentry **, struct sockaddr *);
 
 /* XXX MRT COMPAT VERSIONS THAT SET UNIVERSE to 0 */
@@ -394,6 +399,7 @@ int	 rt_getifa_fib(struct rt_addrinfo *, u_int fibnum);
 void	 rtalloc_ign_fib(struct route *ro, u_long ignflags, u_int fibnum);
 void	 rtalloc_fib(struct route *ro, u_int fibnum);
 struct rtentry *rtalloc1_fib(struct sockaddr *, int, u_long, u_int);
+struct rtentry *rtalloc1_fib_nolock(struct sockaddr *, int, u_long, u_int);
 int	 rtioctl_fib(u_long, caddr_t, u_int);
 void	 rtredirect_fib(struct sockaddr *, struct sockaddr *,
 	    struct sockaddr *, int, struct sockaddr *, u_int);
diff --git a/sys/netinet/in_rmx.c b/sys/netinet/in_rmx.c
index 1389873..1c9d9db 100644
--- a/sys/netinet/in_rmx.c
+++ b/sys/netinet/in_rmx.c
@@ -122,12 +122,12 @@ in_matroute(void *v_arg, struct radix_node_head *head)
 	struct rtentry *rt = (struct rtentry *)rn;
 
 	if (rt) {
-		RT_LOCK(rt);
+//		RT_LOCK(rt);
 		if (rt->rt_flags & RTPRF_OURS) {
 			rt->rt_flags &= ~RTPRF_OURS;
 			rt->rt_rmx.rmx_expire = 0;
 		}
-		RT_UNLOCK(rt);
+//		RT_UNLOCK(rt);
 	}
 	return rn;
 }
@@ -365,7 +365,7 @@ in_inithead(void **head, int off)
 
 	rnh = *head;
 	rnh->rnh_addaddr = in_addroute;
-	rnh->rnh_matchaddr = in_matroute;
+	rnh->rnh_matchaddr = rn_match;
 	rnh->rnh_close = in_clsroute;
 	if (_in_rt_was_here == 0 ) {
 		callout_init(&V_rtq_timer, CALLOUT_MPSAFE);
diff --git a/sys/netinet/ip_fastfwd.c b/sys/netinet/ip_fastfwd.c
index d7fe411..d2b98b3 100644
--- a/sys/netinet/ip_fastfwd.c
+++ b/sys/netinet/ip_fastfwd.c
@@ -112,6 +112,22 @@ static VNET_DEFINE(int, ipfastforward_active);
 SYSCTL_VNET_INT(_net_inet_ip, OID_AUTO, fastforwarding, CTLFLAG_RW,
     &VNET_NAME(ipfastforward_active), 0, "Enable fast IP forwarding");
 
+void
+rtalloc_ign_fib_nolock(struct route *ro, u_long ignore, u_int fibnum);
+
+void
+rtalloc_ign_fib_nolock(struct route *ro, u_long ignore, u_int fibnum)
+{
+	struct rtentry *rt;
+
+	if ((rt = ro->ro_rt) != NULL) {
+		if (rt->rt_ifp != NULL && rt->rt_flags & RTF_UP)
+			return;
+		ro->ro_rt = NULL;
+	}
+	ro->ro_rt = rtalloc1_fib_nolock(&ro->ro_dst, 1, ignore, fibnum);
+}
+
 static struct sockaddr_in *
 ip_findroute(struct route *ro, struct in_addr dest, struct mbuf *m)
 {
@@ -126,7 +142,7 @@ ip_findroute(struct route *ro, struct in_addr dest, struct mbuf *m)
 	dst->sin_family = AF_INET;
 	dst->sin_len = sizeof(*dst);
 	dst->sin_addr.s_addr = dest.s_addr;
-	in_rtalloc_ign(ro, 0, M_GETFIB(m));
+	rtalloc_ign_fib_nolock(ro, 0, M_GETFIB(m));
 
 	/*
 	 * Route there and interface still up?
@@ -140,8 +156,10 @@ ip_findroute(struct route *ro, struct in_addr dest, struct mbuf *m)
 	} else {
 		IPSTAT_INC(ips_noroute);
 		IPSTAT_INC(ips_cantforward);
+#if 0
 		if (rt)
 			RTFREE(rt);
+#endif
 		icmp_error(m, ICMP_UNREACH, ICMP_UNREACH_HOST, 0, 0);
 		return NULL;
 	}
@@ -334,10 +352,11 @@ ip_fastforward(struct mbuf *m)
 	if (in_localip(ip->ip_dst))
 		return m;
 
-	IPSTAT_INC(ips_total);
+	//IPSTAT_INC(ips_total);
 
 	/*
 	 * Step 3: incoming packet firewall processing
+	in_rtalloc_ign(ro, 0, M_GETFIB(m));
 	 */
 
 	/*
@@ -476,8 +495,10 @@ forwardlocal:
 			 * "ours"-label.
 			 */
 			m->m_flags |= M_FASTFWD_OURS;
+/*
 			if (ro.ro_rt)
 				RTFREE(ro.ro_rt);
+*/				
 			return m;
 		}
 		/*
@@ -490,7 +511,7 @@ forwardlocal:
 			m_tag_delete(m, fwd_tag);
 		}
 #endif /* IPFIREWALL_FORWARD */
-		RTFREE(ro.ro_rt);
+//		RTFREE(ro.ro_rt);
 		if ((dst = ip_findroute(&ro, dest, m)) == NULL)
 			return NULL;	/* icmp unreach already sent */
 		ifp = ro.ro_rt->rt_ifp;
@@ -601,17 +622,21 @@ passout:
 	if (error != 0)
 		IPSTAT_INC(ips_odropped);
 	else {
+#if 0
 		ro.ro_rt->rt_rmx.rmx_pksent++;
 		IPSTAT_INC(ips_forward);
 		IPSTAT_INC(ips_fastforward);
+#endif
 	}
 consumed:
-	RTFREE(ro.ro_rt);
+//	RTFREE(ro.ro_rt);
 	return NULL;
 drop:
 	if (m)
 		m_freem(m);
+/*
 	if (ro.ro_rt)
 		RTFREE(ro.ro_rt);
+*/		
 	return NULL;
 }
diff --git a/sys/netinet6/in6_rmx.c b/sys/netinet6/in6_rmx.c
index b526030..9aabe63 100644
--- a/sys/netinet6/in6_rmx.c
+++ b/sys/netinet6/in6_rmx.c
@@ -195,12 +195,12 @@ in6_matroute(void *v_arg, struct radix_node_head *head)
 	struct rtentry *rt = (struct rtentry *)rn;
 
 	if (rt) {
-		RT_LOCK(rt);
+		//RT_LOCK(rt);
 		if (rt->rt_flags & RTPRF_OURS) {
 			rt->rt_flags &= ~RTPRF_OURS;
 			rt->rt_rmx.rmx_expire = 0;
 		}
-		RT_UNLOCK(rt);
+		//RT_UNLOCK(rt);
 	}
 	return rn;
 }
@@ -440,7 +440,7 @@ in6_inithead(void **head, int off)
 
 	rnh = *head;
 	rnh->rnh_addaddr = in6_addroute;
-	rnh->rnh_matchaddr = in6_matroute;
+	rnh->rnh_matchaddr = rn_match;
 
 	if (V__in6_rt_was_here == 0) {
 		callout_init(&V_rtq_timer6, CALLOUT_MPSAFE);
-------------- next part --------------
commit 0e7cebd1753c3b77bdc00d728fbd5910c2d2afec
Author: Charlie Root <root at test15.yandex.net>
Date:   Mon Apr 8 15:35:00 2013 +0000

    Make radix use rmlock.

diff --git a/sys/contrib/ipfilter/netinet/ip_compat.h b/sys/contrib/ipfilter/netinet/ip_compat.h
index 31e5b11..5e74da4 100644
--- a/sys/contrib/ipfilter/netinet/ip_compat.h
+++ b/sys/contrib/ipfilter/netinet/ip_compat.h
@@ -870,6 +870,7 @@ typedef	u_int32_t	u_32_t;
 # if (__FreeBSD_version >= 500043)
 #  include <sys/mutex.h>
 #  if (__FreeBSD_version > 700014)
+#   include <sys/rmlock.h>
 #   include <sys/rwlock.h>
 #    define	KRWLOCK_T		struct rwlock
 #    ifdef _KERNEL
diff --git a/sys/contrib/pf/net/pf_table.c b/sys/contrib/pf/net/pf_table.c
index 40c9f67..b1dd703 100644
--- a/sys/contrib/pf/net/pf_table.c
+++ b/sys/contrib/pf/net/pf_table.c
@@ -44,6 +44,7 @@ __FBSDID("$FreeBSD$");
 #include <sys/mbuf.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
+#include <sys/rmlock.h>
 #include <sys/rwlock.h>
 #ifdef __FreeBSD__
 #include <sys/malloc.h>
diff --git a/sys/kern/subr_witness.c b/sys/kern/subr_witness.c
index e565d01..f913d27 100644
--- a/sys/kern/subr_witness.c
+++ b/sys/kern/subr_witness.c
@@ -508,7 +508,7 @@ static struct witness_order_list_entry order_lists[] = {
 	 * Routing
 	 */
 	{ "so_rcv", &lock_class_mtx_sleep },
-	{ "radix node head", &lock_class_rw },
+	{ "radix node head", &lock_class_rm },
 	{ "rtentry", &lock_class_mtx_sleep },
 	{ "ifaddr", &lock_class_mtx_sleep },
 	{ NULL, NULL },
diff --git a/sys/kern/sys_socket.c b/sys/kern/sys_socket.c
index 4cbae74..fea12d0 100644
--- a/sys/kern/sys_socket.c
+++ b/sys/kern/sys_socket.c
@@ -50,6 +50,8 @@ __FBSDID("$FreeBSD$");
 #include <sys/ucred.h>
 
 #include <net/if.h>
+#include <sys/lock.h>
+#include <sys/rmlock.h>
 #include <net/route.h>
 #include <net/vnet.h>
 
diff --git a/sys/kern/vfs_export.c b/sys/kern/vfs_export.c
index 4185211..848c232 100644
--- a/sys/kern/vfs_export.c
+++ b/sys/kern/vfs_export.c
@@ -47,7 +47,7 @@ __FBSDID("$FreeBSD$");
 #include <sys/mbuf.h>
 #include <sys/mount.h>
 #include <sys/mutex.h>
-#include <sys/rwlock.h>
+#include <sys/rmlock.h>
 #include <sys/refcount.h>
 #include <sys/socket.h>
 #include <sys/systm.h>
@@ -427,6 +427,7 @@ vfs_export_lookup(struct mount *mp, struct sockaddr *nam)
 	register struct netcred *np;
 	register struct radix_node_head *rnh;
 	struct sockaddr *saddr;
+	RADIX_NODE_HEAD_READER;
 
 	nep = mp->mnt_export;
 	if (nep == NULL)
diff --git a/sys/net/if.c b/sys/net/if.c
index 5ecde8c..351e046 100644
--- a/sys/net/if.c
+++ b/sys/net/if.c
@@ -51,6 +51,7 @@
 #include <sys/lock.h>
 #include <sys/refcount.h>
 #include <sys/module.h>
+#include <sys/rmlock.h>
 #include <sys/rwlock.h>
 #include <sys/sockio.h>
 #include <sys/syslog.h>
diff --git a/sys/net/radix.c b/sys/net/radix.c
index 33fcf82..d8d1e8b 100644
--- a/sys/net/radix.c
+++ b/sys/net/radix.c
@@ -37,7 +37,7 @@
 #ifdef	_KERNEL
 #include <sys/lock.h>
 #include <sys/mutex.h>
-#include <sys/rwlock.h>
+#include <sys/rmlock.h>
 #include <sys/systm.h>
 #include <sys/malloc.h>
 #include <sys/syslog.h>
diff --git a/sys/net/radix.h b/sys/net/radix.h
index 29659b5..2d130f0 100644
--- a/sys/net/radix.h
+++ b/sys/net/radix.h
@@ -36,7 +36,7 @@
 #ifdef _KERNEL
 #include <sys/_lock.h>
 #include <sys/_mutex.h>
-#include <sys/_rwlock.h>
+#include <sys/_rmlock.h>
 #endif
 
 #ifdef MALLOC_DECLARE
@@ -133,7 +133,7 @@ struct radix_node_head {
 	struct	radix_node rnh_nodes[3];	/* empty tree for common case */
 	int	rnh_multipath;			/* multipath capable ? */
 #ifdef _KERNEL
-	struct	rwlock rnh_lock;		/* locks entire radix tree */
+	struct	rmlock rnh_lock;		/* locks entire radix tree */
 #endif
 };
 
@@ -146,18 +146,21 @@ struct radix_node_head {
 #define R_Zalloc(p, t, n) (p = (t) malloc((unsigned long)(n), M_RTABLE, M_NOWAIT | M_ZERO))
 #define Free(p) free((caddr_t)p, M_RTABLE);
 
+#define	RADIX_NODE_HEAD_READER		struct rm_priotracker tracker
 #define	RADIX_NODE_HEAD_LOCK_INIT(rnh)	\
-    rw_init_flags(&(rnh)->rnh_lock, "radix node head", 0)
-#define	RADIX_NODE_HEAD_LOCK(rnh)	rw_wlock(&(rnh)->rnh_lock)
-#define	RADIX_NODE_HEAD_UNLOCK(rnh)	rw_wunlock(&(rnh)->rnh_lock)
-#define	RADIX_NODE_HEAD_RLOCK(rnh)	rw_rlock(&(rnh)->rnh_lock)
-#define	RADIX_NODE_HEAD_RUNLOCK(rnh)	rw_runlock(&(rnh)->rnh_lock)
-#define	RADIX_NODE_HEAD_LOCK_TRY_UPGRADE(rnh)	rw_try_upgrade(&(rnh)->rnh_lock)
-
-
-#define	RADIX_NODE_HEAD_DESTROY(rnh)	rw_destroy(&(rnh)->rnh_lock)
-#define	RADIX_NODE_HEAD_LOCK_ASSERT(rnh) rw_assert(&(rnh)->rnh_lock, RA_LOCKED)
-#define	RADIX_NODE_HEAD_WLOCK_ASSERT(rnh) rw_assert(&(rnh)->rnh_lock, RA_WLOCKED)
+    rm_init(&(rnh)->rnh_lock, "radix node head")
+#define	RADIX_NODE_HEAD_LOCK(rnh)	rm_wlock(&(rnh)->rnh_lock)
+#define	RADIX_NODE_HEAD_UNLOCK(rnh)	rm_wunlock(&(rnh)->rnh_lock)
+#define	RADIX_NODE_HEAD_RLOCK(rnh)	rm_rlock(&(rnh)->rnh_lock, &tracker)
+#define	RADIX_NODE_HEAD_RUNLOCK(rnh)	rm_runlock(&(rnh)->rnh_lock, &tracker)
+//#define	RADIX_NODE_HEAD_LOCK_TRY_UPGRADE(rnh)	rw_try_upgrade(&(rnh)->rnh_lock)
+
+
+#define	RADIX_NODE_HEAD_DESTROY(rnh)	rm_destroy(&(rnh)->rnh_lock)
+#define	RADIX_NODE_HEAD_LOCK_ASSERT(rnh)
+#define	RADIX_NODE_HEAD_WLOCK_ASSERT(rnh)
+//#define	RADIX_NODE_HEAD_LOCK_ASSERT(rnh) rw_assert(&(rnh)->rnh_lock, RA_LOCKED)
+//#define	RADIX_NODE_HEAD_WLOCK_ASSERT(rnh) rw_assert(&(rnh)->rnh_lock, RA_WLOCKED)
 #endif /* _KERNEL */
 
 void	 rn_init(int);
diff --git a/sys/net/radix_mpath.c b/sys/net/radix_mpath.c
index ee7826f..c69888e 100644
--- a/sys/net/radix_mpath.c
+++ b/sys/net/radix_mpath.c
@@ -45,6 +45,8 @@ __FBSDID("$FreeBSD$");
 #include <sys/socket.h>
 #include <sys/domain.h>
 #include <sys/syslog.h>
+#include <sys/lock.h>
+#include <sys/rmlock.h>
 #include <net/radix.h>
 #include <net/radix_mpath.h>
 #include <net/route.h>
diff --git a/sys/net/route.c b/sys/net/route.c
index 5d56688..2cf6ea5 100644
--- a/sys/net/route.c
+++ b/sys/net/route.c
@@ -52,6 +52,8 @@
 #include <sys/proc.h>
 #include <sys/domain.h>
 #include <sys/kernel.h>
+#include <sys/lock.h>
+#include <sys/rmlock.h>
 
 #include <net/if.h>
 #include <net/if_dl.h>
@@ -544,6 +546,7 @@ rtalloc1_fib_nolock(struct sockaddr *dst, int report, u_long ignflags,
 	struct rtentry *newrt;
 	struct rt_addrinfo info;
 	int err = 0, msgtype = RTM_MISS;
+	RADIX_NODE_HEAD_READER;
 	int needlock;
 
 	KASSERT((fibnum < rt_numfibs), ("rtalloc1_fib: bad fibnum"));
@@ -612,6 +615,7 @@ rtalloc1_fib(struct sockaddr *dst, int report, u_long ignflags,
 	struct rtentry *newrt;
 	struct rt_addrinfo info;
 	int err = 0, msgtype = RTM_MISS;
+	RADIX_NODE_HEAD_READER;
 	int needlock;
 
 	KASSERT((fibnum < rt_numfibs), ("rtalloc1_fib: bad fibnum"));
@@ -799,6 +803,7 @@ rtredirect_fib(struct sockaddr *dst,
 	struct rt_addrinfo info;
 	struct ifaddr *ifa;
 	struct radix_node_head *rnh;
+	RADIX_NODE_HEAD_READER;
 
 	ifa = NULL;
 	rnh = rt_tables_get_rnh(fibnum, dst->sa_family);
diff --git a/sys/net/rtsock.c b/sys/net/rtsock.c
index 58c46a6..18d3e06 100644
--- a/sys/net/rtsock.c
+++ b/sys/net/rtsock.c
@@ -45,6 +45,7 @@
 #include <sys/priv.h>
 #include <sys/proc.h>
 #include <sys/protosw.h>
+#include <sys/rmlock.h>
 #include <sys/rwlock.h>
 #include <sys/signalvar.h>
 #include <sys/socket.h>
@@ -577,6 +578,7 @@ route_output(struct mbuf *m, struct socket *so)
 	struct ifnet *ifp = NULL;
 	union sockaddr_union saun;
 	sa_family_t saf = AF_UNSPEC;
+	RADIX_NODE_HEAD_READER;
 
 #define senderr(e) { error = e; goto flush;}
 	if (m == NULL || ((m->m_len < sizeof(long)) &&
@@ -1818,6 +1820,7 @@ sysctl_rtsock(SYSCTL_HANDLER_ARGS)
 	int	i, lim, error = EINVAL;
 	u_char	af;
 	struct	walkarg w;
+	RADIX_NODE_HEAD_READER;
 
 	name ++;
 	namelen--;
diff --git a/sys/netinet/in_rmx.c b/sys/netinet/in_rmx.c
index 1c9d9db..775ba5a 100644
--- a/sys/netinet/in_rmx.c
+++ b/sys/netinet/in_rmx.c
@@ -53,6 +53,8 @@ __FBSDID("$FreeBSD$");
 #include <sys/callout.h>
 
 #include <net/if.h>
+#include <sys/lock.h>
+#include <sys/rmlock.h>
 #include <net/route.h>
 #include <net/vnet.h>
 
diff --git a/sys/netinet6/in6_ifattach.c b/sys/netinet6/in6_ifattach.c
index 80eb022..cbfe1d8 100644
--- a/sys/netinet6/in6_ifattach.c
+++ b/sys/netinet6/in6_ifattach.c
@@ -42,6 +42,8 @@ __FBSDID("$FreeBSD$");
 #include <sys/proc.h>
 #include <sys/syslog.h>
 #include <sys/md5.h>
+#include <sys/lock.h>
+#include <sys/rmlock.h>
 
 #include <net/if.h>
 #include <net/if_dl.h>
diff --git a/sys/netinet6/in6_rmx.c b/sys/netinet6/in6_rmx.c
index 9aabe63..a291db2 100644
--- a/sys/netinet6/in6_rmx.c
+++ b/sys/netinet6/in6_rmx.c
@@ -84,6 +84,7 @@ __FBSDID("$FreeBSD$");
 #include <sys/socket.h>
 #include <sys/socketvar.h>
 #include <sys/mbuf.h>
+#include <sys/rmlock.h>
 #include <sys/rwlock.h>
 #include <sys/syslog.h>
 #include <sys/callout.h>
diff --git a/sys/netinet6/nd6_rtr.c b/sys/netinet6/nd6_rtr.c
index 687d84d..7737d47 100644
--- a/sys/netinet6/nd6_rtr.c
+++ b/sys/netinet6/nd6_rtr.c
@@ -45,6 +45,7 @@ __FBSDID("$FreeBSD: stable/8/sys/netinet6/nd6_rtr.c 233201 2012-03-19 20:49:42Z
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/errno.h>
+#include <sys/rmlock.h>
 #include <sys/rwlock.h>
 #include <sys/syslog.h>
 #include <sys/queue.h>
-------------- next part --------------
commit 963196095589c03880ddd13a5c16f9e50cf6d7ce
Author: Charlie Root <root at test15.yandex.net>
Date:   Sun Nov 4 15:52:50 2012 +0000

    Do not require locking arp lle

diff --git a/sys/net/if_llatbl.h b/sys/net/if_llatbl.h
index 9f6531b..c1b2af9 100644
--- a/sys/net/if_llatbl.h
+++ b/sys/net/if_llatbl.h
@@ -169,6 +169,7 @@ MALLOC_DECLARE(M_LLTABLE);
 #define	LLE_PUB		0x0020	/* publish entry ??? */
 #define	LLE_DELETE	0x4000	/* delete on a lookup - match LLE_IFADDR */
 #define	LLE_CREATE	0x8000	/* create on a lookup miss */
+#define	LLE_UNLOCKED	0x1000	/* return lle unlocked */
 #define	LLE_EXCLUSIVE	0x2000	/* return lle xlocked  */
 
 #define LLATBL_HASH(key, mask) \
diff --git a/sys/netinet/if_ether.c b/sys/netinet/if_ether.c
index f61b803..ecb9b8e 100644
--- a/sys/netinet/if_ether.c
+++ b/sys/netinet/if_ether.c
@@ -283,10 +283,10 @@ arpresolve(struct ifnet *ifp, struct rtentry *rt0, struct mbuf *m,
 	struct sockaddr *dst, u_char *desten, struct llentry **lle)
 {
 	struct llentry *la = 0;
-	u_int flags = 0;
+	u_int flags = LLE_UNLOCKED;
 	struct mbuf *curr = NULL;
 	struct mbuf *next = NULL;
-	int error, renew;
+	int error, renew = 0;
 
 	*lle = NULL;
 	if (m != NULL) {
@@ -307,7 +307,41 @@ arpresolve(struct ifnet *ifp, struct rtentry *rt0, struct mbuf *m,
 retry:
 	IF_AFDATA_RLOCK(ifp);	
 	la = lla_lookup(LLTABLE(ifp), flags, dst);
+
+	/*
+	 * Fast path. Do not require rlock on llentry.
+	 */
+	if ((la != NULL) && (flags & LLE_UNLOCKED)) {
+		if ((la->la_flags & LLE_VALID) &&
+		    ((la->la_flags & LLE_STATIC) || la->la_expire > time_uptime)) {
+			bcopy(&la->ll_addr, desten, ifp->if_addrlen);
+			/*
+			 * If entry has an expiry time and it is approaching,
+			 * see if we need to send an ARP request within this
+			 * arpt_down interval.
+			 */
+			if (!(la->la_flags & LLE_STATIC) &&
+			    time_uptime + la->la_preempt > la->la_expire) {
+				renew = 1;
+				la->la_preempt--;
+			}
+
+			IF_AFDATA_RUNLOCK(ifp);
+			if (renew != 0)
+				arprequest(ifp, NULL, &SIN(dst)->sin_addr, NULL);
+
+			return (0);
+		}
+
+		/* Revert to normal path for other cases */
+		*lle = la;
+		LLE_RLOCK(la);
+	}
+
+	flags &= ~LLE_UNLOCKED;
+
 	IF_AFDATA_RUNLOCK(ifp);	
+
 	if ((la == NULL) && ((flags & LLE_EXCLUSIVE) == 0)
 	    && ((ifp->if_flags & (IFF_NOARP | IFF_STATICARP)) == 0)) {		
 		flags |= (LLE_CREATE | LLE_EXCLUSIVE);
@@ -324,27 +358,6 @@ retry:
 		return (EINVAL);
 	} 
 
-	if ((la->la_flags & LLE_VALID) &&
-	    ((la->la_flags & LLE_STATIC) || la->la_expire > time_second)) {
-		bcopy(&la->ll_addr, desten, ifp->if_addrlen);
-		/*
-		 * If entry has an expiry time and it is approaching,
-		 * see if we need to send an ARP request within this
-		 * arpt_down interval.
-		 */
-		if (!(la->la_flags & LLE_STATIC) &&
-		    time_second + la->la_preempt > la->la_expire) {
-			arprequest(ifp, NULL,
-			    &SIN(dst)->sin_addr, IF_LLADDR(ifp));
-
-			la->la_preempt--;
-		}
-		
-		*lle = la;
-		error = 0;
-		goto done;
-	} 
-			    
 	if (la->la_flags & LLE_STATIC) {   /* should not happen! */
 		log(LOG_DEBUG, "arpresolve: ouch, empty static llinfo for %s\n",
 		    inet_ntoa(SIN(dst)->sin_addr));
diff --git a/sys/netinet/in.c b/sys/netinet/in.c
index eaba4e5..5341918 100644
--- a/sys/netinet/in.c
+++ b/sys/netinet/in.c
@@ -1561,7 +1561,7 @@ in_lltable_lookup(struct lltable *llt, u_int flags, const struct sockaddr *l3add
 	if (LLE_IS_VALID(lle)) {
 		if (flags & LLE_EXCLUSIVE)
 			LLE_WLOCK(lle);
-		else
+		else if (!(flags & LLE_UNLOCKED))
 			LLE_RLOCK(lle);
 	}
 done: