resend: multiple routing table roadmap (format fix)
gnn at freebsd.org
gnn at freebsd.org
Fri Dec 28 06:56:35 PST 2007
At Wed, 26 Dec 2007 16:26:11 -0800,
> Resending as my mailer made a dog's breakfast of the first one
> with all sorts of wierd line breaks... hopefully this will be better.
> (I haven't sent it yet so I'm hoping)..
> On thing where FreeBSD has been falling behind, and which by chance
> I have some time to work on is "policy based routing", which allows
> different packet streams to be routed by more than just the
> destination address.
> I want to make some form of this available in the 6.x tree
> (and by extension 7.x) , but FreeBSD in general needs it so I might as
> do it in -current and back port the portions I need.
> One of the ways that this can be done is to have the ability to
> instantiate multiple kernel routing tables (which I will now
> refer to as "Forwarding Information Bases" or "FIBs" for political
> correctness reasons. Which FIB a particular packet uses to make
> the next hop decision can be decided by a number of mechanisms.
> The policies these mechanisms implement are the "Policies" referred
> to in "Policy based routing".
> One of the constraints I have if I try to back port this work to
> 6.x is that it must be implemented as a EXTENSION to the existing
> ABIs in 6.x so that third party applications do not need to be
> recompiled in timespan of the branch.
> Implementation method, (part 1)
> For this reason I have implemented a "sufficient subset" of a
> multiple routing table solution in Perforce, and back-ported it
> to 6.x. (also in Perforce though not yet caught up with what I
> have done in -current/P4). The subset allows a number of FIBs
> to be defined at compile time (sufficient for my purposes in 6.x) and
> implements the changes needed to allow IPV4 to use them. I have not done
> the changes for ipv6 simply because I do not need it, and I do not
> have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.
> Other protocol families are left untouched and should there be
> users with proprietary protocol families, they should continue to work
> and be oblivious to the existence of the extra FIBs.
> To understand how this is done, one must know that the current FIB
> code starts everything off with a single dimensional array of
> pointers to FIB head structures (One per protocol family), each of
> which in turn points to the trie of routes available to that family.
> The basic change in the ABI compatible version of the change is to
> extent that array to be a 2 dimensional array, so that
> instead of protocol family X looking at rt_tables[X] for the
> table it needs, it looks at rt_tables[Y][X] when for all
> protocol families except ipv4 Y is always 0.
> Code that is unaware of the change always just sees the first row
> of the table, which of course looks just like the one dimensional
> array that existed before.
> The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
> are all maintained, but refer only to the first row of the array,
> so that existing callers in proprietary protocols can continue to
> do the "right thing".
> Some new entry points are added, for the exclusive use of ipv4 code
> called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
> which have an extra argument which refers the code to the correct row.
> In addition, there are some new entry points (currently called
> dom_rtalloc() and friends) that check the Address family being
> looked up and call either rtalloc() (and friends) if the protocol
> is not IPv4 forcing the action to row 0 or to the appropriate row
> if it IS IPv4 (and that info is available). These are for calling
> from code that is not specific to any particular protocol. The way
> these are implemented would change in the non ABI preserving code
> to be added later.
> One feature of the first version of the code is that for ipv4,
> the interface routes show up automatically on all the FIBs, so
> that no matter what FIB you select you always have the basic
> direct attached hosts available to you. (rtinit() does this
> You CAN delete an interface route from one FIB should you want
> to but by default it's there. ARP information is also available
> in each FIB. It's assumed that the same machine would have the
> same MAC address, regardless of which FIB you are using to get
> to it.
> This brings us as to how the correct FIB is selected for an outgoing
> IPV4 packet.
> Packets fall into one of a number of classes.
> 1/ locally generated packets, coming from a socket/PCB.
> Such packets select a FIB from a number associated with the
> socket/PCB. This in turn is inherited from the process,
> but can be changed by a socket option. The process in turn
> inherits it on fork. I have written a utility call setfib
> that acts a bit like nice..
> setfib -n 3 ping target.example.com # will use fib 3 for ping.
> 2/ packets received on an interface for forwarding.
> By default these packets would use table 0,
> (or possibly a number settable in a sysctl(not yet)).
> but prior to routing the firewall can inspect them (see below).
> 3/ packets inspected by a packet classifier, which can arbitrarily
> associate a fib with it on a packet by packet basis.
> A fib assigned to a packet by a packet classifier
> (such as ipfw) would over-ride a fib associated by
> a more default source. (such as cases 1 or 2).
> Routing messages would be associated with their
> process, and thus select one FIB or another.
> In addition Netstat has been edited to be able to cope with the
> fact that the array is now 2 dimensional. (It looks in system
> memory using libkvm (!)).
> In addition two sysctls are added to give:
> a) the number of FIBs compiled in (active)
> b) the default FIB of the calling process.
> Early testing experience:
> Basically our (IronPort's) appliance does this functionality already
> using ipfw fwd but that method has some drawbacks.
> For example,
> It can't fully simulate a routing table because it can't influence the
> socket's choice of local address when a connect() is done.
> Testing during the generating of these changes has been
> remarkably smooth so far. Multiple tables have co-existed
> with no notable side effects, and packets have been routes
> I have not yet added the changes to ipfw.
> pf has some similar changes already but they seem to rely on
> the various FIBs having symbolic names. Which I do not plan to support
> in the first version of these changes.
> SCTP has interestingly enough built in support for this, called VRFs
> in Cisco parlance. it will be interesting to see how that handles it
> when it suddenly actually does something.
> I have not redone my testing since my last edits, but will be
> retesting with the current code asap.
> Where to next:
> After committing the ABI compatible version and MFCing it, I'd
> like to proceed in a forward direction in -current. this will
> result in some roto-tilling in the routing code.
> Firstly: the current code's idea of having a separate tree per
> protocol family, all of the same format, and pointed to by the
> 1 dimensional array is a bit silly. Especially when one considers that
> is code that makes assumptions about every protocol having the same
> internal structures there. Some protocols don't WANT that
> sort of structure. (for example the whole idea of a netmask is foreign
> to appletalk). This needs to be made opaque to the external code.
> My suggested first change is to add routing method pointers to the
> 'domain' structure, along with information pointing the data.
> instead of having an array of pointers to uniform structures,
> there would be an array pointing to the 'domain' structures
> for each protocol address domain (protocol family),
> and the methods this reached would be called. The methods would have
> an argument that gives FIB number, but the protocol would be free
> to ignore it.
> Interaction with the ARP layer/ LL layer would need to be
> revisited as well. Qing Li has been working on this already.
> for those with p4 access:
> p4 diff2 -du //depot/vendor/freebsd/src/sys/... at 131121
> for those with the makediff perl script:
> perl ~/makediff.pl //depot/vendor/freebsd/src/sys/... at 131121
> for those with neither:
> I just put the userland utility in usr.sbin/setfib/ in p4.
> and changes to netstat in usr.bin/netstat/
> I'd like to get comments on this (compat) version, so that I can
> commit it, get general testing under way to start the clock for MFC,
> and then get moving on the fuller implementation (that breaks ABIs)
> and other routing issues.
How does this work with Marko Zec's virtual stack system?
More information about the freebsd-net