svn commit: r359823 - in head: etc/mtree include lib/libc/gen sys/conf sys/net sys/net/route sys/netinet sys/netinet6 sys/sys usr.bin/netstat

Alexander V. Chernikov melifaro at FreeBSD.org
Sun Apr 12 14:30:03 UTC 2020


Author: melifaro
Date: Sun Apr 12 14:30:00 2020
New Revision: 359823
URL: https://svnweb.freebsd.org/changeset/base/359823

Log:
  Introduce nexthop objects and new routing KPI.
  
  This is the foundational change for the routing subsytem rearchitecture.
   More details and goals are available in https://reviews.freebsd.org/D24141 .
  
  This patch introduces concept of nexthop objects and new nexthop-based
   routing KPI.
  
  Nexthops are objects, containing all necessary information for performing
   the packet output decision. Output interface, mtu, flags, gw address goes
   there. For most of the cases, these objects will serve the same role as
   the struct rtentry is currently serving.
  Typically there will be low tens of such objects for the router even with
   multiple BGP full-views, as these objects will be shared between routing
   entries. This allows to store more information in the nexthop.
  
  New KPI:
  
  struct nhop_object *fib4_lookup(uint32_t fibnum, struct in_addr dst,
    uint32_t scopeid, uint32_t flags, uint32_t flowid);
  struct nhop_object *fib6_lookup(uint32_t fibnum, const struct in6_addr *dst6,
    uint32_t scopeid, uint32_t flags, uint32_t flowid);
  
  These 2 function are intended to replace all all flavours of
   <in_|in6_>rtalloc[1]<_ign><_fib>, mpath functions  and the previous
   fib[46]-generation functions.
  
  Upon successful lookup, they return nexthop object which is guaranteed to
   exist within current NET_EPOCH. If longer lifetime is desired, one can
   specify NHR_REF as a flag and get a referenced version of the nexthop.
   Reference semantic closely resembles rtentry one, allowing sed-style conversion.
  
  Additionally, another 2 functions are introduced to support uRPF functionality
   inside variety of our firewalls. Their primary goal is to hide the multipath
   implementation details inside the routing subsystem, greatly simplifying
   firewalls implementation:
  
  int fib4_lookup_urpf(uint32_t fibnum, struct in_addr dst, uint32_t scopeid,
    uint32_t flags, const struct ifnet *src_if);
  int fib6_lookup_urpf(uint32_t fibnum, const struct in6_addr *dst6, uint32_t scopeid,
    uint32_t flags, const struct ifnet *src_if);
  
  All functions have a separate scopeid argument, paving way to eliminating IPv6 scope
   embedding and allowing to support IPv4 link-locals in the future.
  
  Structure changes:
   * rtentry gets new 'rt_nhop' pointer, slightly growing the overall size.
   * rib_head gets new 'rnh_preadd' callback pointer, slightly growing overall sz.
  
  Old KPI:
  During the transition state old and new KPI will coexists. As there are another 4-5
   decent-sized conversion patches, it will probably take a couple of weeks.
  To support both KPIs, fields not required by the new KPI (most of rtentry) has to be
   kept, resulting in the temporary size increase.
  Once conversion is finished, rtentry will notably shrink.
  
  More details:
  * architectural overview: https://reviews.freebsd.org/D24141
  * list of the next changes: https://reviews.freebsd.org/D24232
  
  Reviewed by:	ae,glebius(initial version)
  Differential Revision:	https://reviews.freebsd.org/D24232

Added:
  head/sys/net/route/
  head/sys/net/route/nhop.c   (contents, props changed)
  head/sys/net/route/nhop.h   (contents, props changed)
  head/sys/net/route/nhop_ctl.c   (contents, props changed)
  head/sys/net/route/nhop_utils.c   (contents, props changed)
  head/sys/net/route/nhop_utils.h   (contents, props changed)
  head/sys/net/route/nhop_var.h   (contents, props changed)
  head/sys/net/route/route_ctl.c   (contents, props changed)
  head/sys/net/route/route_helpers.c   (contents, props changed)
  head/sys/net/route/shared.h   (contents, props changed)
  head/usr.bin/netstat/common.c   (contents, props changed)
  head/usr.bin/netstat/common.h   (contents, props changed)
  head/usr.bin/netstat/nhops.c   (contents, props changed)
Modified:
  head/etc/mtree/BSD.include.dist
  head/include/Makefile
  head/lib/libc/gen/sysctl.3
  head/sys/conf/files
  head/sys/net/radix_mpath.c
  head/sys/net/radix_mpath.h
  head/sys/net/route.c
  head/sys/net/route.h
  head/sys/net/route_var.h
  head/sys/net/rtsock.c
  head/sys/netinet/in_fib.c
  head/sys/netinet/in_fib.h
  head/sys/netinet/in_rmx.c
  head/sys/netinet6/in6_fib.c
  head/sys/netinet6/in6_fib.h
  head/sys/netinet6/in6_rmx.c
  head/sys/sys/socket.h
  head/usr.bin/netstat/Makefile
  head/usr.bin/netstat/main.c
  head/usr.bin/netstat/netstat.h
  head/usr.bin/netstat/route.c

Modified: head/etc/mtree/BSD.include.dist
==============================================================================
--- head/etc/mtree/BSD.include.dist	Sun Apr 12 09:31:36 2020	(r359822)
+++ head/etc/mtree/BSD.include.dist	Sun Apr 12 14:30:00 2020	(r359823)
@@ -208,6 +208,8 @@
     net
         altq
         ..
+        route
+        ..
     ..
     net80211
     ..

Modified: head/include/Makefile
==============================================================================
--- head/include/Makefile	Sun Apr 12 09:31:36 2020	(r359822)
+++ head/include/Makefile	Sun Apr 12 14:30:00 2020	(r359823)
@@ -53,6 +53,7 @@ LSUBDIRS=	cam/ata cam/mmc cam/nvme cam/scsi \
 	geom/mirror geom/mountver geom/multipath geom/nop \
 	geom/raid geom/raid3 geom/shsec geom/stripe geom/virstor \
 	net/altq \
+	net/route \
 	netgraph/atm netgraph/netflow \
 	netinet/cc \
 	netinet/netdump \

Modified: head/lib/libc/gen/sysctl.3
==============================================================================
--- head/lib/libc/gen/sysctl.3	Sun Apr 12 09:31:36 2020	(r359822)
+++ head/lib/libc/gen/sysctl.3	Sun Apr 12 14:30:00 2020	(r359823)
@@ -563,6 +563,7 @@ The fifth, sixth, and seventh level names are as follo
 .It Dv NET_RT_IFLIST Ta 0 or if_index Ta None
 .It Dv NET_RT_IFMALIST Ta 0 or if_index Ta None
 .It Dv NET_RT_IFLISTL Ta 0 or if_index Ta None
+.It Dv NET_RT_NHOPS Ta None Ta fib number
 .El
 .Pp
 The
@@ -583,6 +584,9 @@ uses 'l' versions of the message header structures:
 .Va struct if_msghdrl
 and
 .Va struct ifa_msghdrl .
+.Pp
+.Dv NET_RT_NHOPS
+returns all nexthops for specified address family in given fib.
 .It Li PF_INET
 Get or set various global information about the IPv4
 (Internet Protocol version 4).

Modified: head/sys/conf/files
==============================================================================
--- head/sys/conf/files	Sun Apr 12 09:31:36 2020	(r359822)
+++ head/sys/conf/files	Sun Apr 12 14:30:00 2020	(r359823)
@@ -4091,6 +4091,11 @@ net/raw_cb.c			standard
 net/raw_usrreq.c		standard
 net/route.c			standard
 net/route_temporal.c		standard
+net/route/nhop.c		standard
+net/route/nhop_ctl.c		standard
+net/route/nhop_utils.c		standard
+net/route/route_ctl.c		standard
+net/route/route_helpers.c	standard
 net/rss_config.c		optional inet rss | inet6 rss
 net/rtsock.c			standard
 net/slcompress.c		optional netgraph_vjc | sppp | \

Modified: head/sys/net/radix_mpath.c
==============================================================================
--- head/sys/net/radix_mpath.c	Sun Apr 12 09:31:36 2020	(r359822)
+++ head/sys/net/radix_mpath.c	Sun Apr 12 14:30:00 2020	(r359823)
@@ -211,7 +211,7 @@ rt_mpath_conflict(struct rib_head *rnh, struct rtentry
 	return (0);
 }
 
-static struct rtentry *
+struct rtentry *
 rt_mpath_selectrte(struct rtentry *rte, uint32_t hash)
 {
 	struct radix_node *rn0, *rn;

Modified: head/sys/net/radix_mpath.h
==============================================================================
--- head/sys/net/radix_mpath.h	Sun Apr 12 09:31:36 2020	(r359822)
+++ head/sys/net/radix_mpath.h	Sun Apr 12 14:30:00 2020	(r359823)
@@ -56,9 +56,26 @@ int rt_mpath_conflict(struct rib_head *, struct rtentr
     struct sockaddr *);
 void rtalloc_mpath_fib(struct route *, u_int32_t, u_int);
 struct rtentry *rt_mpath_select(struct rtentry *, uint32_t);
+struct rtentry *rt_mpath_selectrte(struct rtentry *, uint32_t);
 int rt_mpath_deldup(struct rtentry *, struct rtentry *);
 int	rn4_mpath_inithead(void **, int, u_int);
 int	rn6_mpath_inithead(void **, int, u_int);
+
+static inline struct rtentry *
+rt_mpath_next(struct rtentry *rt)
+{
+	struct radix_node *next, *rn;
+
+	rn = (struct radix_node *)rt;
+
+	if (!rn->rn_dupedkey)
+		return (NULL);
+	next = rn->rn_dupedkey;
+	if (rn->rn_mask == next->rn_mask)
+		return (struct rtentry *)next;
+	else
+		return (NULL);
+}
 
 #endif
 

Modified: head/sys/net/route.c
==============================================================================
--- head/sys/net/route.c	Sun Apr 12 09:31:36 2020	(r359822)
+++ head/sys/net/route.c	Sun Apr 12 14:30:00 2020	(r359823)
@@ -62,6 +62,8 @@
 #include <net/if_dl.h>
 #include <net/route.h>
 #include <net/route_var.h>
+#include <net/route/nhop.h>
+#include <net/route/shared.h>
 #include <net/vnet.h>
 
 #ifdef RADIX_MPATH
@@ -108,10 +110,7 @@ VNET_DEFINE(u_int, rt_add_addr_allfibs) = 1;
 SYSCTL_UINT(_net, OID_AUTO, add_addr_allfibs, CTLFLAG_RWTUN | CTLFLAG_VNET,
     &VNET_NAME(rt_add_addr_allfibs), 0, "");
 
-VNET_PCPUSTAT_DEFINE_STATIC(struct rtstat, rtstat);
-#define	RTSTAT_ADD(name, val)	\
-	VNET_PCPUSTAT_ADD(struct rtstat, rtstat, name, (val))
-#define	RTSTAT_INC(name)	RTSTAT_ADD(name, 1)
+VNET_PCPUSTAT_DEFINE(struct rtstat, rtstat);
 
 VNET_PCPUSTAT_SYSINIT(rtstat);
 #ifdef VIMAGE
@@ -240,6 +239,7 @@ route_init(void)
 		rt_numfibs = RT_MAXFIBS;
 	if (rt_numfibs == 0)
 		rt_numfibs = 1;
+	nhops_init();
 }
 SYSINIT(route_init, SI_SUB_PROTO_DOMAIN, SI_ORDER_THIRD, route_init, NULL);
 
@@ -377,6 +377,8 @@ rt_table_init(int offset, int family, u_int fibnum)
 	/* Init locks */
 	RIB_LOCK_INIT(rh);
 
+	nhops_init_rib(rh);
+
 	/* Finally, set base callbacks */
 	rh->rnh_addaddr = rn_addroute;
 	rh->rnh_deladdr = rn_delete;
@@ -408,6 +410,8 @@ rt_table_destroy(struct rib_head *rh)
 
 	rn_walktree(&rh->rmhead.head, rt_freeentry, &rh->rmhead.head);
 
+	nhops_destroy_rib(rh);
+
 	/* Assume table is already empty */
 	RIB_LOCK_DESTROY(rh);
 	free(rh, M_RTABLE);
@@ -586,6 +590,9 @@ rtfree(struct rtentry *rt)
 		 */
 		R_Free(rt_key(rt));
 
+		/* Unreference nexthop */
+		nhop_free(rt->rt_nhop);
+
 		/*
 		 * and the rtentry itself of course
 		 */
@@ -1400,6 +1407,7 @@ rt_updatemtu(struct ifnet *ifp)
 			RIB_WLOCK(rnh);
 			rnh->rnh_walktree(&rnh->head, if_updatemtu_cb, &ifmtu);
 			RIB_WUNLOCK(rnh);
+			nhops_update_ifmtu(rnh, ifp, ifmtu.mtu);
 		}
 	}
 }
@@ -1544,6 +1552,7 @@ int
 rtrequest1_fib(int req, struct rt_addrinfo *info, struct rtentry **ret_nrt,
 				u_int fibnum)
 {
+	struct epoch_tracker et;
 	const struct sockaddr *dst;
 	struct rib_head *rnh;
 	int error;
@@ -1592,9 +1601,11 @@ rtrequest1_fib(int req, struct rt_addrinfo *info, stru
 		error = add_route(rnh, info, ret_nrt);
 		break;
 	case RTM_CHANGE:
+		NET_EPOCH_ENTER(et);
 		RIB_WLOCK(rnh);
 		error = change_route(rnh, info, ret_nrt);
 		RIB_WUNLOCK(rnh);
+		NET_EPOCH_EXIT(et);
 		break;
 	default:
 		error = EOPNOTSUPP;
@@ -1609,9 +1620,11 @@ add_route(struct rib_head *rnh, struct rt_addrinfo *in
 {
 	struct sockaddr *dst, *ndst, *gateway, *netmask;
 	struct rtentry *rt, *rt_old;
+	struct nhop_object *nh;
 	struct radix_node *rn;
 	struct ifaddr *ifa;
 	int error, flags;
+	struct epoch_tracker et;
 
 	dst = info->rti_info[RTAX_DST];
 	gateway = info->rti_info[RTAX_GATEWAY];
@@ -1631,18 +1644,30 @@ add_route(struct rib_head *rnh, struct rt_addrinfo *in
 	} else {
 		ifa_ref(info->rti_ifa);
 	}
+
+	NET_EPOCH_ENTER(et);
+	error = nhop_create_from_info(rnh, info, &nh);
+	NET_EPOCH_EXIT(et);
+	if (error != 0) {
+		ifa_free(info->rti_ifa);
+		return (error);
+	}
+
 	rt = uma_zalloc(V_rtzone, M_NOWAIT);
 	if (rt == NULL) {
 		ifa_free(info->rti_ifa);
+		nhop_free(nh);
 		return (ENOBUFS);
 	}
 	rt->rt_flags = RTF_UP | flags;
 	rt->rt_fibnum = rnh->rib_fibnum;
+	rt->rt_nhop = nh;
 	/*
 	 * Add the gateway. Possibly re-malloc-ing the storage for it.
 	 */
 	if ((error = rt_setgate(rt, dst, gateway)) != 0) {
 		ifa_free(info->rti_ifa);
+		nhop_free(nh);
 		uma_zfree(V_rtzone, rt);
 		return (error);
 	}
@@ -1682,6 +1707,7 @@ add_route(struct rib_head *rnh, struct rt_addrinfo *in
 
 		ifa_free(rt->rt_ifa);
 		R_Free(rt_key(rt));
+		nhop_free(nh);
 		uma_zfree(V_rtzone, rt);
 		return (EEXIST);
 	}
@@ -1723,6 +1749,7 @@ add_route(struct rib_head *rnh, struct rt_addrinfo *in
 	if (rn == NULL) {
 		ifa_free(rt->rt_ifa);
 		R_Free(rt_key(rt));
+		nhop_free(nh);
 		uma_zfree(V_rtzone, rt);
 		return (EEXIST);
 	} 
@@ -1802,6 +1829,7 @@ change_route(struct rib_head *rnh, struct rt_addrinfo 
 	int error = 0;
 	int free_ifa = 0;
 	int family, mtu;
+	struct nhop_object *nh;
 	struct if_mtuinfo ifmtu;
 
 	RIB_WLOCK_ASSERT(rnh);
@@ -1824,6 +1852,7 @@ change_route(struct rib_head *rnh, struct rt_addrinfo 
 	}
 #endif
 
+	nh = NULL;
 	RT_LOCK(rt);
 
 	rt_setmetrics(info, rt);
@@ -1852,6 +1881,10 @@ change_route(struct rib_head *rnh, struct rt_addrinfo 
 			goto bad;
 	}
 
+	error = nhop_create_from_nhop(rnh, rt->rt_nhop, info, &nh);
+	if (error != 0)
+		goto bad;
+
 	/* Check if outgoing interface has changed */
 	if (info->rti_ifa != NULL && info->rti_ifa != rt->rt_ifa &&
 	    rt->rt_ifa != NULL) {
@@ -1897,6 +1930,11 @@ change_route(struct rib_head *rnh, struct rt_addrinfo 
 		}
 	}
 
+	/* Update nexthop */
+	nhop_free(rt->rt_nhop);
+	rt->rt_nhop = nh;
+	nh = NULL;
+
 	/*
 	 * This route change may have modified the route's gateway.  In that
 	 * case, any inpcbs that have cached this route need to invalidate their
@@ -1910,6 +1948,8 @@ change_route(struct rib_head *rnh, struct rt_addrinfo 
 	}
 bad:
 	RT_UNLOCK(rt);
+	if (nh != NULL)
+		nhop_free(nh);
 	if (free_ifa != 0) {
 		ifa_free(info->rti_ifa);
 		info->rti_ifa = NULL;

Modified: head/sys/net/route.h
==============================================================================
--- head/sys/net/route.h	Sun Apr 12 09:31:36 2020	(r359822)
+++ head/sys/net/route.h	Sun Apr 12 14:30:00 2020	(r359823)
@@ -90,7 +90,8 @@ struct rt_metrics {
 	u_long	rmx_rttvar;	/* estimated rtt variance */
 	u_long	rmx_pksent;	/* packets sent using this route */
 	u_long	rmx_weight;	/* route weight */
-	u_long	rmx_filler[3];	/* will be used for T/TCP later */
+	u_long	rmx_nhidx;	/* route nexhop index */
+	u_long	rmx_filler[2];	/* will be used for T/TCP later */
 };
 
 /*
@@ -150,6 +151,7 @@ struct rtentry {
 	struct	sockaddr *rt_gateway;	/* value */
 	struct	ifnet *rt_ifp;		/* the answer: interface to use */
 	struct	ifaddr *rt_ifa;		/* the answer: interface address to use */
+	struct nhop_object	*rt_nhop;	/* nexthop data */
 	int		rt_flags;	/* up/down?, host/net */
 	int		rt_refcnt;	/* # held references */
 	u_int		rt_fibnum;	/* which FIB */
@@ -215,9 +217,13 @@ struct rtentry {
 #define	NHF_HOST		0x0400	/* RTF_HOST */
 
 /* Nexthop request flags */
+#define	NHR_NONE		0x00	/* empty flags field */
 #define	NHR_IFAIF		0x01	/* Return ifa_ifp interface */
 #define	NHR_REF			0x02	/* For future use */
 
+/* uRPF */
+#define	NHR_NODEFAULT		0x04	/* do not consider default route */
+
 /* Control plane route request flags */
 #define	NHR_COPY		0x100	/* Copy rte data */
 
@@ -245,6 +251,8 @@ struct rtstat {
 	uint64_t rts_newgateway;	/* routes modified by redirects */
 	uint64_t rts_unreach;		/* lookups which failed */
 	uint64_t rts_wildcard;		/* lookups satisfied by a wildcard */
+	uint64_t rts_nh_idx_alloc_failure;	/* nexthop index alloc failure*/
+	uint64_t rts_nh_alloc_failure;	/* nexthop allocation failure*/
 };
 
 /*
@@ -507,6 +515,8 @@ int	rib_add_redirect(u_int fibnum, struct sockaddr *ds
 	   struct sockaddr *gateway, struct sockaddr *author, struct ifnet *ifp,
 	   int flags, int expire_sec);
 
+/* New API */
+void	rib_walk(int af, u_int fibnum, rt_walktree_f_t *wa_f, void *arg);
 #endif
 
 #endif

Added: head/sys/net/route/nhop.c
==============================================================================
--- /dev/null	00:00:00 1970	(empty, because file is newly added)
+++ head/sys/net/route/nhop.c	Sun Apr 12 14:30:00 2020	(r359823)
@@ -0,0 +1,388 @@
+/*-
+ * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
+ *
+ * Copyright (c) 2020 Alexander V. Chernikov
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include <sys/cdefs.h>
+__FBSDID("$FreeBSD$");
+#include "opt_inet.h"
+#include "opt_route.h"
+
+#include <sys/param.h>
+#include <sys/systm.h>
+#include <sys/lock.h>
+#include <sys/rwlock.h>
+#include <sys/malloc.h>
+#include <sys/mbuf.h>
+#include <sys/socket.h>
+#include <sys/kernel.h>
+
+#include <net/if.h>
+#include <net/if_var.h>
+#include <net/route.h>
+#include <net/route_var.h>
+#include <net/route/nhop_utils.h>
+#include <net/route/nhop.h>
+#include <net/route/nhop_var.h>
+#include <net/route/shared.h>
+#include <net/vnet.h>
+
+/*
+ * This file contains data structures management logic for the nexthop ("nhop")
+ *   route subsystem.
+ *
+ * Nexthops in the original sense are the objects containing all the necessary
+ * information to forward the packet to the selected destination.
+ * In particular, nexthop is defined by a combination of
+ *  ifp, ifa, aifp, mtu, gw addr(if set), nh_type, nh_family, mask of rt_flags and
+ *    NHF_DEFAULT
+ *
+ * All nexthops are stored in the resizable hash table.
+ * Additionally, each nexthop gets assigned its unique index (nexthop index)
+ * so userland programs can interact with the nexthops easier. Index allocation
+ * is backed by the bitmask array.
+ */
+
+static MALLOC_DEFINE(M_NHOP, "nhops", "nexthops data");
+
+
+/* Hash management functions */
+
+int
+nhops_init_rib(struct rib_head *rh)
+{
+	struct nh_control *ctl;
+	size_t alloc_size;
+	uint32_t num_buckets, num_items;
+	void *ptr;
+
+	ctl = malloc(sizeof(struct nh_control), M_NHOP, M_WAITOK | M_ZERO);
+
+	/*
+	 * Allocate nexthop hash. Start with 16 items by default (128 bytes).
+	 * This will be enough for most of the cases.
+	 */
+	num_buckets = 16;
+	alloc_size = CHT_SLIST_GET_RESIZE_SIZE(num_buckets);
+	ptr = malloc(alloc_size, M_NHOP, M_WAITOK | M_ZERO);
+	CHT_SLIST_INIT(&ctl->nh_head, ptr, num_buckets);
+
+	/*
+	 * Allocate nexthop index bitmask.
+	 */
+	num_items = 128 * 8; /* 128 bytes */
+	ptr = malloc(bitmask_get_size(num_items), M_NHOP, M_WAITOK | M_ZERO);
+	bitmask_init(&ctl->nh_idx_head, ptr, num_items);
+
+	NHOPS_LOCK_INIT(ctl);
+
+	rh->nh_control = ctl;
+	ctl->ctl_rh = rh;
+
+	DPRINTF("NHOPS init for fib %u af %u: ctl %p rh %p", rh->rib_fibnum,
+	    rh->rib_family, ctl, rh);
+
+	return (0);
+}
+
+static void
+destroy_ctl(struct nh_control *ctl)
+{
+
+	NHOPS_LOCK_DESTROY(ctl);
+	free(ctl->nh_head.ptr, M_NHOP);
+	free(ctl->nh_idx_head.idx, M_NHOP);
+	free(ctl, M_NHOP);
+}
+
+/*
+ * Epoch callback indicating ctl is safe to destroy
+ */
+static void
+destroy_ctl_epoch(epoch_context_t ctx)
+{
+	struct nh_control *ctl;
+
+	ctl = __containerof(ctx, struct nh_control, ctl_epoch_ctx);
+
+	destroy_ctl(ctl);
+}
+
+void
+nhops_destroy_rib(struct rib_head *rh)
+{
+	struct nh_control *ctl;
+	struct nhop_priv *nh_priv;
+
+	ctl = rh->nh_control;
+
+	/*
+	 * All routes should have been deleted in rt_table_destroy().
+	 * However, TCP stack or other consumers may store referenced
+	 *  nexthop pointers. When these references go to zero,
+	 *  nhop_free() will try to unlink these records from the
+	 *  datastructures, most likely leading to panic.
+	 *
+	 * Avoid that by explicitly marking all of the remaining
+	 *  nexthops as unlinked by removing a reference from a special
+	 *  counter. Please see nhop_free() comments for more
+	 *  details.
+	 */
+
+	NHOPS_WLOCK(ctl);
+	CHT_SLIST_FOREACH(&ctl->nh_head, nhops, nh_priv) {
+		DPRINTF("Marking nhop %u unlinked", nh_priv->nh_idx);
+		refcount_release(&nh_priv->nh_linked);
+	} CHT_SLIST_FOREACH_END;
+	NHOPS_WUNLOCK(ctl);
+
+	/*
+	 * Postpone destruction till the end of current epoch
+	 * so nhop_free() can safely use nh_control pointer.
+	 */
+	epoch_call(net_epoch_preempt, destroy_ctl_epoch,
+	    &ctl->ctl_epoch_ctx);
+}
+
+/*
+ * Nexhop hash calculation:
+ *
+ * Nexthops distribution:
+ * 2 "mandatory" nexthops per interface ("interface route", "loopback").
+ * For direct peering: 1 nexthop for the peering router per ifp/af.
+ * For Ix-like peering: tens to hundreds nexthops of neghbors per ifp/af.
+ * IGP control plane & broadcast segment: tens of nexthops per ifp/af.
+ *
+ * Each fib/af combination has its own hash table.
+ * With that in mind, hash nexthops by the combination of the interface
+ *  and GW IP address.
+ *
+ * To optimize hash calculation, ignore higher bytes of ifindex, as they
+ *  give very little entropy.
+ * Similarly, use lower 4 bytes of IPv6 address to distinguish between the
+ *  neighbors.
+ */
+struct _hash_data {
+	uint16_t	ifindex;
+	uint8_t		family;
+	uint8_t		nh_type;
+	uint32_t	gw_addr;
+};
+
+static unsigned
+djb_hash(const unsigned char *h, const int len)
+{
+	unsigned int result = 0;
+	int i;
+
+	for (i = 0; i < len; i++)
+		result = 33 * result ^ h[i];
+
+	return (result);
+}
+
+static uint32_t
+hash_priv(const struct nhop_priv *priv)
+{
+	struct nhop_object *nh;
+	uint16_t ifindex;
+	struct _hash_data key;
+
+	nh = priv->nh;
+	ifindex = nh->nh_ifp->if_index & 0xFFFF;
+	memset(&key, 0, sizeof(key));
+
+	key.ifindex = ifindex;
+	key.family = nh->gw_sa.sa_family;
+	key.nh_type = priv->nh_type & 0xFF;
+	if (nh->gw_sa.sa_family == AF_INET6)
+		memcpy(&key.gw_addr, &nh->gw6_sa.sin6_addr.s6_addr32[3], 4);
+	else if (nh->gw_sa.sa_family == AF_INET)
+		memcpy(&key.gw_addr, &nh->gw4_sa.sin_addr, 4);
+
+	return (uint32_t)(djb_hash((const unsigned char *)&key, sizeof(key)));
+}
+
+/*
+ * Checks if hash needs resizing and performs this resize if necessary
+ *
+ */
+static void
+consider_resize(struct nh_control *ctl, uint32_t new_nh_buckets, uint32_t new_idx_items)
+{
+	void *nh_ptr, *nh_idx_ptr;
+	void *old_idx_ptr;
+	size_t alloc_size;
+
+	nh_ptr = NULL;
+	if (new_nh_buckets != 0) {
+		alloc_size = CHT_SLIST_GET_RESIZE_SIZE(new_nh_buckets);
+		nh_ptr = malloc(alloc_size, M_NHOP, M_NOWAIT | M_ZERO);
+	}
+
+	nh_idx_ptr = NULL;
+	if (new_idx_items != 0) {
+		alloc_size = bitmask_get_size(new_idx_items);
+		nh_idx_ptr = malloc(alloc_size, M_NHOP, M_NOWAIT | M_ZERO);
+	}
+
+	if (nh_ptr == NULL && nh_idx_ptr == NULL) {
+		/* Either resize is not required or allocations have failed. */
+		return;
+	}
+
+	DPRINTF("going to resize: nh:[ptr:%p sz:%u] idx:[ptr:%p sz:%u]", nh_ptr,
+	    new_nh_buckets, nh_idx_ptr, new_idx_items);
+
+	old_idx_ptr = NULL;
+
+	NHOPS_WLOCK(ctl);
+	if (nh_ptr != NULL) {
+		CHT_SLIST_RESIZE(&ctl->nh_head, nhops, nh_ptr, new_nh_buckets);
+	}
+	if (nh_idx_ptr != NULL) {
+		if (bitmask_copy(&ctl->nh_idx_head, nh_idx_ptr, new_idx_items) == 0)
+			bitmask_swap(&ctl->nh_idx_head, nh_idx_ptr, new_idx_items, &old_idx_ptr);
+	}
+	NHOPS_WUNLOCK(ctl);
+
+	if (nh_ptr != NULL)
+		free(nh_ptr, M_NHOP);
+	if (old_idx_ptr != NULL)
+		free(old_idx_ptr, M_NHOP);
+}
+
+/*
+ * Links nextop @nh_priv to the nexhop hash table and allocates
+ *  nexhop index.
+ * Returns allocated index or 0 on failure.
+ */
+int
+link_nhop(struct nh_control *ctl, struct nhop_priv *nh_priv)
+{
+	uint16_t idx;
+	uint32_t num_buckets_new, num_items_new;
+
+	KASSERT((nh_priv->nh_idx == 0), ("nhop index is already allocated"));
+	NHOPS_WLOCK(ctl);
+
+	/*
+	 * Check if we need to resize hash and index.
+	 * The following 2 functions returns either new size or 0
+	 *  if resize is not required.
+	 */
+	num_buckets_new = CHT_SLIST_GET_RESIZE_BUCKETS(&ctl->nh_head);
+	num_items_new = bitmask_get_resize_items(&ctl->nh_idx_head);
+
+	if (bitmask_alloc_idx(&ctl->nh_idx_head, &idx) != 0) {
+		NHOPS_WUNLOCK(ctl);
+		DPRINTF("Unable to allocate nhop index");
+		RTSTAT_INC(rts_nh_idx_alloc_failure);
+		consider_resize(ctl, num_buckets_new, num_items_new);
+		return (0);
+	}
+
+	nh_priv->nh_idx = idx;
+	nh_priv->nh_control = ctl;
+
+	CHT_SLIST_INSERT_HEAD(&ctl->nh_head, nhops, nh_priv);
+
+	NHOPS_WUNLOCK(ctl);
+
+	DPRINTF("Linked nhop priv %p to %d, hash %u, ctl %p", nh_priv, idx,
+	    hash_priv(nh_priv), ctl);
+	consider_resize(ctl, num_buckets_new, num_items_new);
+
+	return (idx);
+}
+
+/*
+ * Unlinks nexthop specified by @nh_priv data from the hash.
+ *
+ * Returns found nexthop or NULL.
+ */
+struct nhop_priv *
+unlink_nhop(struct nh_control *ctl, struct nhop_priv *nh_priv_del)
+{
+	struct nhop_priv *priv_ret;
+	int idx;
+	uint32_t num_buckets_new, num_items_new;
+
+	idx = 0;
+
+	NHOPS_WLOCK(ctl);
+	CHT_SLIST_REMOVE_BYOBJ(&ctl->nh_head, nhops, nh_priv_del, priv_ret);
+
+	if (priv_ret != NULL) {
+		idx = priv_ret->nh_idx;
+		priv_ret->nh_idx = 0;
+
+		KASSERT((idx != 0), ("bogus nhop index 0"));
+		if ((bitmask_free_idx(&ctl->nh_idx_head, idx)) != 0) {
+			DPRINTF("Unable to remove index %d from fib %u af %d",
+			    idx, ctl->ctl_rh->rib_fibnum,
+			    ctl->ctl_rh->rib_family);
+		}
+	}
+
+	/* Check if hash or index needs to be resized */
+	num_buckets_new = CHT_SLIST_GET_RESIZE_BUCKETS(&ctl->nh_head);
+	num_items_new = bitmask_get_resize_items(&ctl->nh_idx_head);
+
+	NHOPS_WUNLOCK(ctl);
+
+	if (priv_ret == NULL)
+		DPRINTF("Unable to unlink nhop priv %p from hash, hash %u ctl %p",
+		    nh_priv_del, hash_priv(nh_priv_del), ctl);
+	else
+		DPRINTF("Unlinked nhop %p priv idx %d", priv_ret, idx);
+
+	consider_resize(ctl, num_buckets_new, num_items_new);
+
+	return (priv_ret);
+}
+
+/*
+ * Searches for the nexthop by data specifcied in @nh_priv.
+ * Returns referenced nexthop or NULL.
+ */
+struct nhop_priv *
+find_nhop(struct nh_control *ctl, const struct nhop_priv *nh_priv)
+{
+	struct nhop_priv *nh_priv_ret;
+
+	NHOPS_RLOCK(ctl);
+	CHT_SLIST_FIND_BYOBJ(&ctl->nh_head, nhops, nh_priv, nh_priv_ret);
+	if (nh_priv_ret != NULL) {
+		if (refcount_acquire_if_not_zero(&nh_priv_ret->nh_refcnt) == 0){
+			/* refcount was 0 -> nhop is being deleted */
+			nh_priv_ret = NULL;
+		}
+	}
+	NHOPS_RUNLOCK(ctl);
+
+	return (nh_priv_ret);
+}
+

Added: head/sys/net/route/nhop.h
==============================================================================
--- /dev/null	00:00:00 1970	(empty, because file is newly added)
+++ head/sys/net/route/nhop.h	Sun Apr 12 14:30:00 2020	(r359823)
@@ -0,0 +1,229 @@
+/*-
+ * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
+ *
+ * Copyright (c) 2020 Alexander V. Chernikov
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ * $FreeBSD$
+ */
+
+/*
+ * This header file contains public definitions for the nexthop routing subsystem.
+ */
+
+#ifndef	_NET_ROUTE_NHOP_H_
+#define	_NET_ROUTE_NHOP_H_
+
+#include <netinet/in.h>			/* sockaddr_in && sockaddr_in6 */
+
+#include <sys/counter.h>
+
+enum nhop_type {
+	NH_TYPE_IPV4_ETHER_RSLV = 1,	/* IPv4 ethernet without GW */
+	NH_TYPE_IPV4_ETHER_NHOP = 2,	/* IPv4 with pre-calculated ethernet encap */
+	NH_TYPE_IPV6_ETHER_RSLV = 3,	/* IPv6 ethernet, without GW */
+	NH_TYPE_IPV6_ETHER_NHOP = 4	/* IPv6 with pre-calculated ethernet encap*/
+};
+
+#ifdef _KERNEL
+
+/*
+ * Define shorter version of AF_LINK sockaddr.
+ *
+ * Currently the only use case of AF_LINK gateway is storing
+ * interface index of the interface of the source IPv6 address.
+ * This is used by the IPv6 code for the connections over loopback
+ * interface.
+ *
+ * The structure below copies 'struct sockaddr_dl', reducing the
+ * size of sdl_data buffer, as it is not used. This change
+ * allows to store the AF_LINK gateways in the nhop gateway itself,
+ * simplifying control plane handling.
+ */
+struct sockaddr_dl_short {
+	u_char	sdl_len;	/* Total length of sockaddr */
+	u_char	sdl_family;	/* AF_LINK */
+	u_short	sdl_index;	/* if != 0, system given index for interface */
+	u_char	sdl_type;	/* interface type */
+	u_char	sdl_nlen;	/* interface name length, no trailing 0 reqd. */
+	u_char	sdl_alen;	/* link level address length */
+	u_char	sdl_slen;	/* link layer selector length */
+	char	sdl_data[8];	/* unused */
+};
+
+#define	NHOP_RELATED_FLAGS	\
+	(RTF_GATEWAY | RTF_HOST | RTF_REJECT | RTF_BLACKHOLE | \
+	 RTF_FIXEDMTU | RTF_LOCAL | RTF_BROADCAST | RTF_MULTICAST)
+
+struct nh_control;
+struct nhop_priv;
+
+/*
+ * Struct 'nhop_object' field description:
+ *
+ * nh_flags: NHF_ flags used in the dataplane code. NHF_GATEWAY or NHF_BLACKHOLE
+ *   can be examples of such flags.
+ * nh_mtu: ready-to-use nexthop mtu. Already accounts for the link-level header,
+ *   interface MTU and protocol-specific limitations.
+ * nh_prepend_len: link-level prepend length. Currently unused.
+ * nh_ifp: logical transmit interface. The one from which if_transmit() will be
+ *   called. Guaranteed to be non-NULL.
+ * nh_aifp: ifnet of the source address. Same as nh_ifp except IPv6 loopback
+ *   routes. See the example below.
+ * nh_ifa: interface address to use. Guaranteed to be non-NULL. 
+ * nh_pksent: counter(9) reflecting the number of packets transmitted.
+ *
+ * gw_: storage suitable to hold AF_INET, AF_INET6 or AF_LINK gateway. More
+ *   details ara available in the examples below.
+ *
+ * Examples:
+ *
+ * Direct routes (routes w/o gateway):
+ *  NHF_GATEWAY is NOT set.
+ *  nh_ifp denotes the logical transmit interface ().
+ *  nh_aifp is the same as nh_ifp
+ *  gw_sa contains AF_LINK sa with nh_aifp ifindex (compat)
+ * Loopback routes:
+ *  NHF_GATEWAY is NOT set.
+ *  nh_ifp points to the loopback interface (lo0).
+ *  nh_aifp points to the interface where the destination address belongs to.
+ *    This is useful in IPv6 link-local-over-loopback communications.
+ *  gw_sa contains AF_LINK sa with nh_aifp ifindex (compat)
+ * GW routes:
+ *  NHF_GATEWAY is set.
+ *  nh_ifp denotes the logical transmit interface.
+ *  nh_aifp is the same as nh_ifp
+ *  gw_sa contains L3 address (either AF_INET or AF_INET6).
+ *
+ *
+ * Note: struct nhop_object fields are ordered in a way that
+ *  supports memcmp-based comparisons.
+ *
+ */
+#define	NHOP_END_CMP	(__offsetof(struct nhop_object, nh_pksent))
+
+struct nhop_object {
+	uint16_t		nh_flags;	/* nhop flags */
+	uint16_t		nh_mtu;		/* nexthop mtu */
+	union {
+		struct sockaddr_in		gw4_sa;	/* GW accessor as IPv4 */
+		struct sockaddr_in6		gw6_sa; /* GW accessor as IPv6 */
+		struct sockaddr			gw_sa;
+		struct sockaddr_dl_short	gwl_sa; /* AF_LINK gw (compat) */
+		char				gw_buf[28];
+	};
+	struct ifnet		*nh_ifp;	/* Logical egress interface. Always != NULL */
+	struct ifaddr		*nh_ifa;	/* interface address to use. Always != NULL */
+	struct ifnet		*nh_aifp;	/* ifnet of the source address. Always != NULL */
+	counter_u64_t		nh_pksent;	/* packets sent using this nhop */
+	/* 32 bytes + 4xPTR == 64(amd64) / 48(i386)  */
+	uint8_t			nh_prepend_len;	/* length of prepend data */
+	uint8_t			spare[3];
+	uint32_t		spare1;		/* alignment */
+	char			nh_prepend[48];	/* L2 prepend */
+	struct nhop_priv	*nh_priv;	/* control plane data */
+	/* -- 128 bytes -- */
+};
+
+/*
+ * Nhop validness.
+ *
+ * Currently we verify whether link is up or not on every packet, which can be
+ *   quite costy.
+ * TODO: subscribe for the interface notifications and update the nexthops
+ *  with NHF_INVALID flag.
+ */
+
+#define	NH_IS_VALID(_nh)	RT_LINK_IS_UP((_nh)->nh_ifp)
+#define	NH_IS_MULTIPATH(_nh)	((_nh)->nh_flags & NHF_MULTIPATH)
+
+#define	RT_GATEWAY(_rt)		((struct sockaddr *)&(_rt)->rt_nhop->gw4_sa)
+#define	RT_GATEWAY_CONST(_rt)	((const struct sockaddr *)&(_rt)->rt_nhop->gw4_sa)
+
+#define	NH_FREE(_nh) do {					\
+	nhop_free(_nh);	\
+	/* guard against invalid refs */			\
+	_nh = NULL;						\
+} while (0)
+
+
+void nhop_free(struct nhop_object *nh);
+
+struct sysctl_req;
+struct sockaddr_dl;
+struct rib_head;
+
+uint32_t nhop_get_idx(const struct nhop_object *nh);
+enum nhop_type nhop_get_type(const struct nhop_object *nh);
+int nhop_get_rtflags(const struct nhop_object *nh);
+
+int nhops_dump_sysctl(struct rib_head *rh, struct sysctl_req *w);
+
+#endif /* _KERNEL */
+
+/* Kernel <> userland structures */
+
+/* Structure usage and layout are described in dump_nhop_entry() */
+struct nhop_external {
+	uint32_t	nh_len;		/* length of the datastructure */
+	uint32_t	nh_idx;		/* Nexthop index */
+	uint32_t	nh_fib;		/* Fib nexhop is attached to */
+	uint32_t	ifindex;	/* transmit interface ifindex */
+	uint32_t	aifindex;	/* address ifindex */
+	uint8_t		prepend_len;	/* length of the prepend */
+	uint8_t		nh_family;	/* address family */
+	uint16_t	nh_type;	/* nexthop type */
+	uint16_t	nh_mtu;		/* nexthop mtu */
+
+	uint16_t	nh_flags;	/* nhop flags */
+	struct in_addr	nh_addr;	/* GW/DST IPv4 address */
+	struct in_addr	nh_src;		/* default source IPv4 address */
+	uint64_t	nh_pksent;
+	/* control plane */
+	/* lookup key: address, family, type */
+	char		nh_prepend[64];	/* L2 prepend */
+	uint64_t	nh_refcount;	/* number of references */
+};
+
+struct nhop_addrs {
+	uint32_t	na_len;		/* length of the datastructure */
+	uint16_t	gw_sa_off;	/* offset of gateway SA */
+	uint16_t	src_sa_off;	/* offset of src address SA */
+};
+
+struct mpath_nhop_external {
+	uint32_t	nh_idx;
+	uint32_t	nh_weight;
+};
+
+struct mpath_external {
+	uint32_t	mp_idx;
+	uint32_t	mp_refcount;
+	uint32_t	mp_nh_count;
+	uint32_t	mp_group_size;
+};
+
+
+#endif
+
+

Added: head/sys/net/route/nhop_ctl.c
==============================================================================
--- /dev/null	00:00:00 1970	(empty, because file is newly added)
+++ head/sys/net/route/nhop_ctl.c	Sun Apr 12 14:30:00 2020	(r359823)
@@ -0,0 +1,827 @@
+/*-
+ * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
+ *
+ * Copyright (c) 2020 Alexander V. Chernikov
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include <sys/cdefs.h>
+__FBSDID("$FreeBSD$");
+#include "opt_inet.h"
+#include "opt_route.h"
+
+#include <sys/param.h>

*** DIFF OUTPUT TRUNCATED AT 1000 LINES ***


More information about the svn-src-head mailing list