[patch] interface routes

Thu Mar 7 21:26:37 UTC 2013

On 08.03.2013 00:53, Andre Oppermann wrote:
> On 07.03.2013 16:34, Alexander V. Chernikov wrote:
>> On 07.03.2013 17:51, Andre Oppermann wrote:
>>> On 07.03.2013 14:38, Ermal Luçi wrote:
>>>> Isn't it better to teach the routing code about metrics.
>>>> Routing daemons cope better this way and they can handle this.
>>>> So the policy of this behaviour can be controled by administrator
>>>> rather than by code!
>>>> With metrics you can add routes with bigger metric for interfaces and
>>>> lower from routing daemons.
>>>> This also can mitigate somehow on interfaces with the same subnet
>>>> configured possibly.
>>>
>>> Generally I agree with you that this would be the ideal outcome.
>>> However we're still quite a bit away from reaching that goal.
>>> To make this really work we have make mpath plus metrics a first
>>> class citizen in the routing code and also the update the routing
>>> daemons kernel interfaces to know about this. I hope we get there
>>> in the not too distant future.
>  >
>> Radix is already over-bloated. Typically in performance-oriented
>> solutions (hardware/software routers from vendors) there is clear
>> separation between RIB (where route protocol attributes, best candidate
>> routes, routes with different priority exists) and FIB, which is
>> typically some kind of radix with minimum needed info, e.g:
>> prefix, nexthops, their interfaces, optional L2 data to prepend.
>
> ACK. Though the bloat in itself is not main problem other than kernel
> memory consumption. If you think of it in cache line misses everything
> more than 128 bytes away is potentially a cache miss. The additional
> distance due to a large or small structure makes no difference. What
> makes an important difference is the internal layout of the structure
> and whether the relevant variables are within the same cache line.
> This can be a problem in a large structure when some data is at the
> beginning and other data at the end on a different cache line. Here
> potentially twice the cache miss latency per trie element hurts.
Yup. I'm talking in cache line terms only.
>
> If we can manage to put everything for a trie search into the first
> cache line we're quit good already. The additional win for tighter
> packing isn't that large anymore.
>
>> Our radix stands somewhere between RIB and FIB (since we have to support
>> route(8) and upper layer protocols): it serves badly as RIB (little
>> functionality) and as FIB: too much overhead and inefficient/too general
>> code.
>
> ACK. There is a big philosophical question on the model. Make it a
> RIB so that independent but complementary routing daemons can add
> routes concurrently and the kernel knows which have higher priority
> or are equal cost for traffic balancing (as in bgpd+ospfd). Or strip
> it to a FIB and have a external program do the RIB and coordination
> across routing daemons (as in Quagga suite).
>
>> For example, sizeof(rt_nodes[2]) (first element of rte) is 96 bytes on
>> amd64.
>
> That is a problem if the trie traversal function accesses fields beyond
> the this cache line. The main problem is that key and mask are pointers
> and thus external to the radix_node adding even more cache misses.
Yes.
>
>> Additionally, rte refcount approach is totally broken.
>
> ACK. Copy and out. No references or external pointers into the table.
>
>> I'm currently thinking of adding some kind of hooks to current
>> route/radix code to permit building efficient trie (or other structure)
>> for given address family and to use it for forwarding purposes only.
>
> AFAIK Marco Zec and/or Luigi have done some work in this area as well.
>
>> For example, I don't need trie while doing MPLS label switching:
>> assuming control plane allocates contiguous label space, I can use label
>> array for efficient lookup.
>
> Nobody's forcing you to use a radix trie for MPLS. In theory each
> protocol can chose its own best method.
Well, actually this is not quite true, and that is the problem.

Userland has to manage kernel MPLS entries somehow, and route socket is 
bound to radix pretty heavily. Additionally, our route(8) abuses kvm(3) 
interface and simply walks thru in-kernel radix tree to print routes and
additional information like refcouns/use count. There is very-very-old 
(but still working) code there printing more or less the same via sysctl 
api, but additional info is not propagated.
>