misc/164130: broken netisr initialization

Mon Jan 30 10:28:28 UTC 2012

On 17 Jan 2012, at 17:41, Коньков Евгений wrote:

> Loads only netisr3.
> and question: ip works over ethernet. How you can distinguish ip and ether???

netstat -Q is showing you per-protocol (layer) processing statistics. An IP packet arriving via ethernet will typically be counted twice: once for ethernet input/decapsulation, and once for IP-layer processing. Netisr dispatch serves a number of purposes, not least preventing excessive stack depth/recursion and load balancing.

There has been a historic tension between deferred (queued) dispatch to a separate worker and direct dispatch ("process to completion"). The former offers more opportunities for parallelism and reduces latency during interrupt-layer processing. However, the latter reduces overhead and overall packet latency for higher-level parallelism by avoiding queueing/scheduling overheads, as well as avoiding packets migration between caches, reducing cache coherency traffic. Our general experience is that many common configurations, especially lower-end systems *and* systems with multi-queue 10gbps cards, prefer direct dispatch. However, there are forwarding scenarios or ones in which CPU count significantly outnumbers NIC input queue count, where queuing to additional workers can markedly improve performance.

In FreeBSD 9.0 we've attempted to improve the vocabulary of expressible policies in netisr so that we can explore which work best in various scenarios, giving users more flexibility but also attempting to determine a better longer-term model. Ideally, as with the VM system, these features would be to some extent self-tuning, but we don't have enough information and experience to decide how best to do that yet.

>     NETISR_POLICY_FLOW    netisr should maintain flow ordering as defined by
>                           the mbuf header flow ID field.  If the protocol
>                           implements nh_m2flow, then netisr will query the
>                           protocol in the event that the mbuf doesn't have a
>                           flow ID, falling back on source ordering.
> 
>     NETISR_POLICY_CPU     netisr will entirely delegate all work placement
>                           decisions to the protocol, querying nh_m2cpuid for
>                           each packet.
> 
> _FLOW: description says that cpuid discovered by flow.
> _CPU: here decision to choose CPU is deligated to protocol. maybe it
> will be clear to name it as: NETISR_POLICY_PROTO ???

The name has to do with the nature of the information returned by the netisr protocol handler -- in the former case, the protocol returns a flow identifier, which is used by netisr to calculate an affinity. In the latter case, the protocol returns a CPU affinity directly.

> and BIG QUESTION: why you allow to somebody (flow, proto) to make any
> decisions??? That is wrong: because of bad their
> implementation/decision may cause to schedule packets only to some CPU.
> So one CPU will overloaded (0%idle) other will be free. (100%idle)

I think you're confusing policy and mechanism. The above KPIs are about providing the mechanism to implement a variety of policies. Many of the policies we are interested in are not yet implemented, or available only as patches. Keep in mind that workloads and systems are highly variable, with variable costs for work dispatch, etc. We run on high-end Intel servers, where individual CPUs tend to be very powerful but not all that plentiful, but also embedded multi-threadd MIPS devices with many threads, each individually quite weak. Deferred dispatch is a better choice for the latter, where there are optimised handoff primitives to help avoid queueing overhead, whereas in the former case you really want NIC-backed work dispatch, which will generally mean you want direct dispatch with multiple ithreads (one per queue) rather than multiple netisr threads. Using deferred dispatch in Intel-style environments is generally unproductive, since high-end configurations will support multi-queue input already, and CPUs are quite powerful.

>> * Enforcing ordering limits the opportunity for concurrency, but maintains
>> * the strong ordering requirements found in some protocols, such as TCP.
> TCP do not require strong ordering requiremets!!! Maybe you mean UDP?

I think most people would disagree with this. Reordering TCP segments leads to extremely poor TCP behaviour -- there is an extensive research literature on this, and maintaining ordering for TCP flows is a critical network stack design goal.

> To get full concurency you must put new flowid to free CPU and
> remember cpuid for that flow.

Stateful assignment of flows to CPUs is of significant interest to use, although currently we only support hash-based assignment without state. In large part, that decision is a good one, as multi-queue network cards are highly variable in terms of the size of their state tables for offloading flow-specific affinity policies. For example, lower-end 10gbps cards may support state tables with 32 entries. High-end cards may support state tables with tens of thousands of entries.

> Just hash packetflow to then number of thrreads: net.isr.numthreads
> nws_array[flowid]= hash( flowid, sourceid, ifp->if_index, source )
> if( cpuload( nws_array[flowid] )>99 )
> nws_array[flowid]++;  //queue packet to other CPU
> 
> that will be just ten lines of conde instead of 50 in your case.

We support a more complex KPI because we need to support future policies that are more complex. For example, there are out-of-tree changes that align TCP-level and netisr-level per-CPU data structures and affinity with NIC RSS support. The algorithm you've suggested above explicitly introduces reordering, which would significant damage network performance, even though it appears to balance CPU load better.

> Also nitice you have:
> /*
> * Utility routines for protocols that implement their own mapping of flows
> * to CPUs.
> */
> u_int
> netisr_get_cpucount(void)
> {
> 
>        return (nws_count);
> }
> 
> but you do not use it! that break incapsulation.

This is a public symbol for use outside of the netisr framework -- for example, in the uncommitted RSS code.

> Also I want to ask you: help me please where I can find documention
> about scheduling netisr and full packetflow through kernel:
> packetinput->kernel->packetoutput
> but more description what is going on with packet while it is passing
> router.

Unfortunately, this code is currently largely self-documenting. The Stevens' books are getting quite outdated, as are McKusick/Neville-Neil -- however, they at least offer structural guides which may be of use to you. Refreshes of these books would be extremely helpful.

Robert