Advice on a multithreaded netisr patch?
julian at elischer.org
Tue Apr 7 09:47:42 PDT 2009
Barney Cordoba wrote:
> --- On Mon, 4/6/09, Robert Watson <rwatson at FreeBSD.org> wrote:
>> From: Robert Watson <rwatson at FreeBSD.org>
>> Subject: Re: Advice on a multithreaded netisr patch?
>> To: "Ivan Voras" <ivoras at freebsd.org>
>> Cc: freebsd-net at freebsd.org
>> Date: Monday, April 6, 2009, 7:59 AM
>> On Mon, 6 Apr 2009, Ivan Voras wrote:
>>>>> I'd like to understand more. If (in
>> netisr) I have a mbuf with headers, is this data already
>> transfered from the card or is it magically "not here
>>>> A lot depends on the details of the card and
>> driver. The driver will take cache misses on the descriptor
>> ring entry, if it's not already in cache, and the link
>> layer will take a cache miss on the front of the ethernet
>> frame in the cluster pointed to by the mbuf header as part
>> of its demux. What happens next depends on your dispatch
>> model and cache line size. Let's make a few simplifying
>> assumptions that are mostly true:
>>> So, a mbuf can reference data not yet copied from the
>> NIC hardware? I'm specifically trying to undestand what
>> m_pullup() does.
>> I think we're talking slightly at cross purposes.
>> There are two transfers of interest:
>> (1) DMA of the packet data to main memory from the NIC
>> (2) Servicing of CPU cache misses to access data in main
>> By the time you receive an interrupt, the DMA is complete,
>> so once you believe a packet referenced by the descriptor
>> ring is done, you don't have to wait for DMA. However,
>> the packet data is in main memory rather than your CPU
>> cache, so you'll need to take a cache miss in order to
>> retrieve it. You don't want to prefetch before you know
>> the packet data is there, or you may prefetch stale data
>> from the previous packet sent or received from the cluster.
>> m_pullup() has to do with mbuf chain memory contiguity
>> during packet processing. The usual usage is something
>> along the following lines:
>> struct whatever *w;
>> m = m_pullup(m, sizeof(*w));
>> if (m == NULL)
>> w = mtod(m, struct whatever *);
>> m_pullup() here ensures that the first sizeof(*w) bytes of
>> mbuf data are contiguously stored so that the cast of w to
>> m's data will point at a complete structure we can use
>> to interpret packet data. In the common case in the receipt
>> path, m_pullup() should be a no-op, since almost all drivers
>> receive data in a single cluster.
>> However, there are cases where it might not happen, such as
>> loopback traffic where unusual encapsulation is used,
>> leading to a call to M_PREPEND() that inserts a new mbuf on
>> the front of the chain, which is later m_defrag()'d
>> leading to a higher level header crossing a boundary or the
>> This issue is almost entirely independent from things like
>> the cache line miss issue, unless you hit the uncommon case
>> of having to do work in m_pullup(), in which case life
>> It would be useful to use DTrace to profile a number of the
>> workfull m_foo() functions to make sure we're not
>> hitting them in normal workloads, btw.
>>>>> As the card and the OS can already process
>> many packets per second for
>>>>> something fairly complex as routing
>>>>> (http://www.tancsa.com/blast.html), and TCP
>> chokes swi:net at 100% of
>>>>> a core, isn't this indication there's
>> certainly more space for
>>>>> improvement even with a single-queue
>> old-fashioned NICs?
>>>> Maybe. It depends on the relative costs of local
>> processing vs
>>>> redistributing the work, which involves
>> schedulers, IPIs, additional
>>>> cache misses, lock contention, and so on. This
>> means there's a period
>>>> where it can't possibly be a win, and then at
>> some point it's a win as
>>>> long as the stack scales. This is essentially the
>> usual trade-off in
>>>> using threads and parallelism: does the benefit of
>> multiple parallel
>>>> execution units make up for the overheads of
>> synchronization and data
>>> Do you have any idea at all why I'm seeing the
>> weird difference of netstat packets per second (250,000) and
>> my application's TCP performance (< 1,000 pps)?
>> Summary: each packet is guaranteed to be a whole message
>> causing a transaction in the application - without the
>> changes I see pps almost identical to tps. Even if the
>> source of netstat statistics somehow manages to count
>> packets multiple time (I don't see how that can happen),
>> no relation can describe differences this huge. It almost
>> looks like something in the upper layers is discarding
>> packets (also not likely: TCP timeouts would occur and the
>> application wouldn't be able to push 250,000 pps) - but
>> what? Where to look?
>> Is this for the loopback workload? If so, remember that
>> there may be some other things going on:
>> - Every packet is processed at least two times: once went
>> sent, and then again
>> when it's received.
>> - A TCP segment will need to be ACK'd, so if you're
>> sending data in chunks in
>> one direction, the ACKs will not be piggy-backed on
>> existing data tranfers,
>> and instead be sent independently, hitting the network
>> stack two more times.
>> - Remember that TCP works to expand its window, and then
>> maintains the highest
>> performance it can by bumping up against the top of
>> available bandwidth
>> continuously. This involves detecting buffer limits by
>> generating packets
>> that can't be sent, adding to the packet count. With
>> loopback traffic, the
>> drop point occurs when you exceed the size of the
>> netisr's queue for IP, so
>> you might try bumping that from the default to something
>> much larger.
>> And nothing beats using tcpdump -- have you tried
>> tcpdumping the loopback to see what is actually being sent?
>> If not, that's always educational -- perhaps something
>> weird is going on with delayed ACKs, etc.
>>> You mean for the general code? I purposely don't
>> lock my statistics variables because I'm not that
>> interested in exact numbers (orders of magnitude are
>> relevant). As far as I understand, unlocked "x++"
>> should be trivially fast in this case?
>> No. x++ is massively slow if executed in parallel across
>> many cores on a variable in a single cache line. See my
>> recent commit to kern_tc.c for an example: the updating of
>> trivial statistics for the kernel time calls reduced 30m
>> syscalls/second to 3m syscalls/second due to heavy
>> contention on the cache line holding the statistic. One of
>> my goals for 8.0 is to fix this problem for IP and TCP
>> layers, and ideally also ifnet but we'll see. We should
>> be maintaining those stats per-CPU and then aggregating to
>> report them to userspace. This is what we already do for a
>> number of system stats -- UMA and kernel malloc, syscall and
>> trap counters, etc.
>>>> - Use cpuset to pin ithreads, the netisr, and
>> whatever else, to specific
>>>> so that they don't migrate, and if your
>> system uses HTT, experiment with
>>>> pinning the ithread and the netisr on different
>> threads on the same
>>>> core, or
>>>> at least, different cores on the same die.
>>> I'm using em hardware; I still think there's a
>> possibility I'm fighting the driver in some cases but
>> this has priority #2.
>> Have you tried LOCK_PROFILING? It would quickly tell you
>> if driver locks were a source of significant contention. It
>> works quite well...
> When I enabled LOCK_PROFILING my side modules, such as if_ibg,
> stopped working. It seems that the ifnet structure or something
> changed with that option enabled. Is there a way to sync this without
> having to integrate everything into a specific kernel build?
no, I don't think there is any other way..
last time I checked the mutex structure changed size which meant that
almost everything else that included a mutex changed size.
That may not be true now but I haven't checked..
> freebsd-net at freebsd.org mailing list
> To unsubscribe, send any mail to "freebsd-net-unsubscribe at freebsd.org"
More information about the freebsd-net