resend: multiple routing table roadmap (format fix)

Sat Jan 5 16:30:33 PST 2008

Vadim Goncharov wrote:
> 04.01.08 @ 00:52 Julian Elischer wrote:
> 
>>>> By the way, I might add that in the 6.x compat. version I may end up
>>>> limiting the feature to 8 tables. This is because I need to store some
>>>> stuff in an efficient way in the mbuf, and in a compatible manner 
>>>> this is easiest done by stealing the top 4 bits in the mbuf dlags word
>>>> and defining them as:
>>>>
>>>>   #define M_HAVEFIB    0x10000000
>>>>   #define M_FIBMASK    0x07
>>>>   #define M_FIBNUM    0xe0000000
>>>>   #define M_FIBSHIFT    29
>>>>   #define m_getfib(_m, _default) ((m->m_flags & M_HAVE_FIBNUM) ? 
>>>> ((m->m_flags >> M_FIBSHIFT) & M_FIBMASK) : _default)
>>>>   #M_SETFIB(_m, _fib) do { \
>>>>     _m->m_flags &= ~M_FIBNUM; \
>>>>     _m->m_flags |= (M_HAVEFIB|((_fib & M_FIBMASK) << M_FIBSHIFT));\
>>>> } while (0)
>>>>
>>>> This then becomes very easy to change to use a tag or
>>>> whatever is needed in later versions , and the number can
>>>> be expanded past 8 predefined  FIBs at that time..
>>>  If you want it to be a tag, why spent bits in m_flags and not just 
>>> do it as a tag at once? Or it is supposed to completely throw away 
>>> 6.x (possibly 7.x too) implementation in favor of right thing in 8.0 ?
>>
>> basically yes..
>>
>> I'm looking at just doing tags to start with, but haven't done it 
>> yet.. I'm looking for a good bit of tag code to copy :-)
> 
> Look at ipfw's O_ALTQ/O_TAG/O_TAGGED (ands some other parts), ng_tag.c, 
> ng_ipfw.c, ng_ksocket.c and some other stuff :-) Tags are simple, if 16 
> bits are enough to you then even do not have to allocate data, just use 
> tag_id member. Also they are easy to manipulate within netgraph with 
> ng_tag, etc. But as drawback - you have to allocate memory for them, an 
> as it is M_NOWAIT, malloc() can return NULL in interrupt threads... So a 
> new field in mbuf (or flags) would be better in terms of performance, 
> but it will break ABI :(

so that may happen later.. this code is specifically to not break
ABIs.

The tag method worries me as overhead for potentially every packet
might bee too much.  In mbuf field is the delux solution.

> 
> I don't have m_tag_alloc() measurements, though. Doing 'ipfw add 1 tag 1 
> ip from any to any' on a 15 kpps 6.2 router didn't cause any noticeable 
> slowdown while looking for half a minute at 'systat -vm 1'...

that already has ipfw overhead.
it may be noticable if you are coparing adding and reading tags in a 
data path with no ipfw overhead.

> 
>>   setfib 3 /bin/sh
>>
>> now by default everythign you do uses table 3.
>> or even
>>
>> setfib 3 jail {blah}
>>
>> and all the procs in the jail use table 3. You also need to do
>> setfib 3 jexec xxx
>> for extra processes you add to the jail afterwards.
> 
> May be introduce a field in a struct prison to make it possible without 
> additional commands?

yes it's in my original description email that that may be an option.

> 
>>>>>> 2/ packets received on an interface for forwarding.
>>>>>>     By default these packets would use table 0,
>>>>>>     (or possibly a number settable in a sysctl(not yet)).
>>>>>>     but prior to routing the firewall can inspect them (see below).
>>>>>>
>>>>>> 3/ packets inspected by a packet classifier, which can arbitrarily
>>>>>>     associate a fib with it on a packet by packet basis.
>>>>>>     A fib assigned to a packet by a packet classifier
>>>>>>     (such as ipfw) would over-ride a fib associated by
>>>>>>     a more default source. (such as cases 1 or 2).
>>>  Sounds good. I like idea to do routing decisions in firewall, to not 
>>> double kernel code and userspace utilities, like in Linux' iproute2 
>>> (which, however, still have a few parameters and relies on firewall 
>>> marks for others). However, there are some cases, I think, where it 
>>> could be done outisde firewall. For example, make an ifconfig option 
>>> to use a specific FIB as a default for all packets outgoing from this 
>>> interface's address. But here arises another related question - Linux 
>>> allows to select a specific src IP based on a routing table entry - 
>>> destination address (thoughts about pf reply-to/route-ro, huh).
>>
>> that is default here too if I understand what you are talking about.
>> teh src address is selected from the routing table's exit interface.
>> In the code I'm showing in perforce, that address would depend on 
>> which table your process was associated with. (or just the socket if 
>> you have used the socket option on it before doing the bind/connect)
> 
> What I'm talking about is adding possibility for future MPLS/VRF/etc. 
> For example, if we make an interface option to use a specific FIB on 
> that interface, for every incoming packet (put a tag on early input?), 
> then ARP replies, ICMP redirects (yes, make stack to process them to 
> particular FIB if specified, not to main) and so on will affect only 
> this table. Then, it will be possible, say, to have 192.168.0.0/24 on 
> em0 and also have 192.168.0.0/24 on em1, but that networks are 
> completely independent of each other on both L2 and L3 (different 
> customers) - after that, a change allowing to have the same IP address 
> on different interfaces will lead to complete virtual independence. 
> Without any vimages - why do we need separate TCP stacks etc. copies on 
> a router without any jails, under a single administrator's control?
> 
> Yes, this may be difficult with planned L2/L3 separation (currently ARP 
> table is in fact part of FIB), but it is solvable - say, by binding an 
> ARP table to one or several FIBs. Moreover, I think that complete stack 
> virtulization in each jail/vimage is waste of resources - instead one or 
> several FIBs/interfaces/ARP tables can be bound to each vimage/jail, 
> possibly with write permissions.

I'm a great believer of vimage. I don't want to duplicate that 
functionality.

> 
> And even all of above is considered a far future and/or will be made 
> different way, FIB binding to interface is still useful for (both 
> incoming and) outgoing packets to make a firewall ruleset simpler.

"maybe"

> 
>>> In relation to this I can remember multipath routing (different 
>>> metrics?), addresses from one subnet on different ifaces (mask wider 
>>> /32) and so on.
>>> Also it is interesting, how multiple FIBs would interact with 
>>> host-wide events, such as ICMP redirects (which table should be 
>>> updated?), storing of TCP stack metrics (MTU, etc.) and hostcache, 
>>> and so on. How these and above will be solved?..
>>
>> I'm not really too knowledgeable about multicast..

typo .. I meant multipath.

> 
> Is multicast and multipath routing the same?
> 
>>> per ifconfig (>1 host per subnet)/icmp redirects/src to prefer, 
>>> multipath/metrics, tcp stack parameters interaction, iproute2
>>
>> I'm not trying to solve problems that need vimage to solve them..
> 
> Umm, what vimage?.. :) I forgot to clear these keywords written for 
> myself when writing draft and expaining them in detail,sorry :)

Marko's vimage code solves much of this in a much cleaner manner.
I'm hoping that we will eventually have multiple routing tables
in multiple vimages.

> 
>>>>>> Routing messages would be associated with their
>>>>>> process, and thus select one FIB or another.
>>>  This is not clear. How should the 'route' command work with 
>>> different FIBs, if they are supposed by admin to be used for 
>>> forwarding, and not the straight per-process? I think a setfib option 
>>> is more consistent than running route under setfib command. Also, 
>>> routing sockets and routing daemons - should they work with only one 
>>> table?..
>>
>> if you do
>> setfib 3 route get 1.1.1.1
>>
>> you may get a different result from
>>
>> setfib 2 route get 1.1.1.1
>>
>> I will add a fibnum argument to route itself as well but it's not 
>> needed immediately as long as I have the setfib command.
> 
> OK, but we should think about it in the future. In theory, routing 
> socket's messages are easily extendable with FIB number in uint16_t, as 
> message keeps it's length...

I will do that with the advice of people who know that protocol better 
than I do.

> 
>>>>>> I have not yet added the changes to ipfw.
>>>  Action modifier, like 'ipfw add count setfib 3 ip from any to any' ? 
>>> There were thoughts (I heard,t as a hack before multiple FIBs) about 
>>> making an additional, say, 'nexthop' ipfw action, which acts like 
>>> fwd, but does not accept packet, allowing to continue it through 
>>> firewall ruleset - thus making it more comfortable to separate 
>>> routing (imagine 'nexthop tablearg') and filtering. There are 
>>> questions with both fwd and new supposed option: will fwd still 
>>> survive? Will it change the output interface, like as complete 
>>> rerouting before calling pfil(9) hooks, so that *oif will be changed 
>>> to be mathed iin rules below? pf route-to/reply-to is hanging around...
>>
>> The 'nexthop' cal you suggest is problematic because it needs to 
>> return information immediately. which is why it is terminal.
> 
> Um, why? Why it can't continue through ruleset? I don't know 
> implementation details of routing and 'ipfw fwd', alas,

the way the nexthop/fwd command is implemented, the rule needs to 
return to the caller immediatly.

> 
>> As for the setfib ipfw action, I have now done this in p4.
>>
>> ipfw add 200 setfib 3 ip from any to any in receive em0
>>
>> now works.
>> This lessens the need for associating a fib with an interface as the 
>> firewall can do that too..
>>
>> the setfib rule is not terminal. (hmm need to check I did that right.)
> 
> Oh, it it works, that's cool.
> 
>> you can also do
>> ipfw add 200 skipto 300 ip from any to any hasfib
>>   # to select on a packet that has a fib associated with it already.
>> ipfw add 200 skipto 300 ip from any to any fib 4
>>   # to slelect packets that are associated with fib 4
>> ipfw add 200 clrfib ip from any to any
>>   # to remove a fib association from the packet.
> 
> Do we need a separate keyword 'clrfib' while it could be 'setfib 0' ? Or 
> at least save one opcode in kernel's ipfw. Also, it would be nice to 
> have 'setfib tablearg' together with reserving 16 bits for FIB number - 
> some systems with hundreds of vlans will want to have more than 256 
> tables, I think...

having an override fib is differnt from having a fib of 0.
I'm not sure about tablearg yet.. I've considered it but not in the 
first version..

> 
>>>>>> Interaction with the ARP layer/ LL layer would need to be
>>>>>> revisited as well. Qing Li has been working on this already.
>>>  Oh yes, L2 interaction is interesting. How it should work in case of 
>>> planned separation of routing and ARP tables?..
> 
> I've explained my views about it above...
>