MPLS

Mon Mar 18 13:41:17 UTC 2013

On 18.03.2013 13:20, Alexander V. Chernikov wrote:
> On 17.03.2013, at 23:54, Andre Oppermann <andre at freebsd.org> wrote:
>
>> On 17.03.2013 19:57, Alexander V. Chernikov wrote:
>>> On 17.03.2013 13:20, Sami Halabi wrote:
>>>>> ITOH OpenBSD has a complete implementation of MPLS out of the box, maybe
>>> Their control plane code is mostly useless due to design approach (routing daemons talk via kernel).
>>
>> What's your approach?
> It is actually not mine. We have discussed this a bit in radix-related thread. Generally quagga/bird (and other hiperf hardware-accelerated and software routers) have feature-rich RIb from which best routes (possibly multipath) are installed to kernel/fib. Kernel main task should be to do efficient lookups while every other advanced feature should be implemented in userland.

Yes, we have started discussing it but haven't reached a conclusion among the
two philosophies.  We have also agreed that the current radix code is horrible
in terms of cache misses per lookup.  That however doesn't preclude an agnostic
FIB+RIB approach.  It's mostly a matter of structure layout to keep it efficient.

>>> Their data plane code, well.. Yes, we can use some defines from their headers, but that's all :)
>>>>> porting it would be short and more straight forward than porting linux LDP
>>>>> implementation of BIRD.
>>>
>>> It is not 'linux' implementation. LDP itself is cross-platform.
>>> The most tricky place here is control plane.
>>> However, making _fast_ MPLS switching is tricky too, since it requires chages in our netisr/ethernet
>>> handling code.
>>
>> Can you explain what changes you think are necessary and why?
 >
> We definitely need ability to dispatch chain of mbufs - this was already discussed in intel rx ring lock thread in -net.

Actually I'm not so convinced of that.  Packet handling is a tradeoff between
doing process-to-completion on each packet and doing context switches on batches
of packets.

Every few years the balance tilts forth and back between process-to-completion
and batch processing.  DragonFly went with a batch-lite token-passing approach
throughout their kernel.  It seems it didn't work out to the extent they expected.
Now many parts are moving back to the more traditional locking approach.

> Currently significant number of drivers support interrupt moderation permitting several/tens/hundreds of packets to be received on interrupt.

But they've also started to provide multiple queues.

> For each packet we have to run some basic checks, PFIL hooks, netisr code, l3 code resulting in many locks being acquired/released per each packet.

Right, on the other hand you'll likely run into serious interlock and latency
issues when large batches of packets monopolize certain locks preventing other
interfaces from sending their batches up.

> Typically we rely on NIC to put packet in given queue (direct isr), which works bad for non-hashable types of traffic like gre, PPPoE, MPLS. Additionally, hashing function is either standard (from M$ NDIS) or documented permitting someone malicious to generate 'special' traffic matching single queue.

Malicious traffic is always a problem, no matter how many queues you have.

> Currently even if we can add m2flowid/m2cpu function able to hash, say, gre or MPLS, it is unefficient since we have to lock/unlock netisr queues for every packet.

Yes, however I'm arguing that our locking strategy may be broken or sub-optimal.

> I'm thinking of
> * utilizing m_nextpkt field in mbuf header

OK.  That's what it is there for.

> * adding some nh_chain flag to netisr
> If given netisr does not support flag and nextpkt is not null we simply call such netisr in cycle.
> * netisr hash function accepts mbuf 'chain' and pointer to array (Sizeof N * ptr),  sorts mbuf to N netisr queues saving list heads to supplied array. After that we put given lists to appropriate queues.
> * teach ethersubr RX code to deal with mbuf chains (not easy one)
> * add some partial support of handling chains to fastfwd code

I really don't think this going to help much.  You're just adding a lot of
latency and context switches to while packet path.  Also you're making it
much more complicated.

The interface drivers and how they manage the boundary between RX ring and
the stack is not optimal yet.  I think there's a lot of potential there.  In
my tcp_workqueue branch I started to experiment with a couple of approaches.
It's not complete yet though.

The big advantage of having the interface RX thread pushing the packets is
that it provides a natural feedback loop regarding system load.  Once you
have more packets coming in than you can process, the RX dma ring gets
naturally starved and the load is stabilized on the input side preventing
a live-lock that can easily happen in batch mode.  Only a well-adjusted
driver works properly though and we don't have any yet in that regard.

Before we start to invent complicated mbuf batching methods lets make sure
that the single packet path at its maximal possible efficiency.  And only
then evaluate more complicated approaches on whether they deliver additional
gains.

 From that follows that we should:

  1. fix longest prefix match radix to minimize cache misses.

  2. fix drivers to optimize RX dequeuing and TX enqueuing.

  3. have a critical look at other parts of the packet path to avoid
     or optimize costly operations (in_local() for example).

-- 
Andre