[patch] interface routes

Thu Mar 7 20:53:40 UTC 2013

On 07.03.2013 16:34, Alexander V. Chernikov wrote:
> On 07.03.2013 17:51, Andre Oppermann wrote:
>> On 07.03.2013 14:38, Ermal Luçi wrote:
>>> Isn't it better to teach the routing code about metrics.
>>> Routing daemons cope better this way and they can handle this.
>>> So the policy of this behaviour can be controled by administrator
>>> rather than by code!
>>> With metrics you can add routes with bigger metric for interfaces and
>>> lower from routing daemons.
>>> This also can mitigate somehow on interfaces with the same subnet
>>> configured possibly.
>>
>> Generally I agree with you that this would be the ideal outcome.
>> However we're still quite a bit away from reaching that goal.
>> To make this really work we have make mpath plus metrics a first
>> class citizen in the routing code and also the update the routing
>> daemons kernel interfaces to know about this.  I hope we get there
>> in the not too distant future.
 >
> Radix is already over-bloated. Typically in performance-oriented
> solutions (hardware/software routers from vendors) there is clear
> separation between RIB (where route protocol attributes, best candidate
> routes, routes with different priority exists) and FIB, which is
> typically some kind of radix with minimum needed info, e.g:
> prefix, nexthops, their interfaces, optional L2 data to prepend.

ACK.  Though the bloat in itself is not main problem other than kernel
memory consumption.  If you think of it in cache line misses everything
more than 128 bytes away is potentially a cache miss.  The additional
distance due to a large or small structure makes no difference.  What
makes an important difference is the internal layout of the structure
and whether the relevant variables are within the same cache line.
This can be a problem in a large structure when some data is at the
beginning and other data at the end on a different cache line.  Here
potentially twice the cache miss latency per trie element hurts.

If we can manage to put everything for a trie search into the first
cache line we're quit good already.  The additional win for tighter
packing isn't that large anymore.

> Our radix stands somewhere between RIB and FIB (since we have to support
> route(8) and upper layer protocols): it serves badly as RIB (little
> functionality) and as FIB: too much overhead and inefficient/too general
> code.

ACK.  There is a big philosophical question on the model.  Make it a
RIB so that independent but complementary routing daemons can add
routes concurrently and the kernel knows which have higher priority
or are equal cost for traffic balancing (as in bgpd+ospfd).  Or strip
it to a FIB and have a external program do the RIB and coordination
across routing daemons (as in Quagga suite).

> For example, sizeof(rt_nodes[2]) (first element of rte) is 96 bytes on
> amd64.

That is a problem if the trie traversal function accesses fields beyond
the this cache line.  The main problem is that key and mask are pointers
and thus external to the radix_node adding even more cache misses.

> Additionally, rte refcount approach is totally broken.

ACK.  Copy and out.  No references or external pointers into the table.

> I'm currently thinking of adding some kind of hooks to current
> route/radix code to permit building efficient trie (or other structure)
> for given address family and to use it for forwarding purposes only.

AFAIK Marco Zec and/or Luigi have done some work in this area as well.

> For example, I don't need trie while doing MPLS label switching:
> assuming control plane allocates contiguous label space, I can use label
> array for efficient lookup.

Nobody's forcing you to use a radix trie for MPLS.  In theory each
protocol can chose its own best method.

-- 
Andre