svn commit: r248196 - head/sys/nfs

Tue Mar 12 19:49:47 UTC 2013

  Andre,

On Tue, Mar 12, 2013 at 05:33:04PM +0100, Andre Oppermann wrote:
A> > We have very common case when we allocate either mbuf or mbuf + cluster,
A> > depending on size. Everywhere this is made by hand, but can be substituted
A> > with m_get2(len, ..., M_NOJUMBO);
A> 
A> I guess what I'm trying to say is that not wanting jumbo > PAGE_SIZE is
A> normal and shouldn't be specified all the time.
A> 
A> This makes the API look like this:
A> 
A>   m_get2(len, ..., 0);	/* w/o flags I get at most MJUMPAGESIZE */
A> 
A> If someone really, really, really knows what he is doing he can say
A> he wants jumbo > PAGE_SIZE returned with M_JUMBOOK or such.  However
A> IMHO even that shouldn't be offered and m_getm2() should be used for
A> a chain.

Once again: there are cases when chain is not anticipated, a single
contigous mbuf is.

A> > A> However I think that m_get2() should never ever even try to attempt to
A> > A> allocate mbuf clusters larger than PAGE_SIZE.  Not even with flags.
A> > A>
A> > A> All mbufs > PAGE_SIZE should be exclusively and only ever be used by drivers
A> > A> for NIC's with "challenged" DMA engines.  Possibly only available through a
A> > A> dedicated API to prevent all other uses of it.
A> >
A> > Have you done any benchmarking that proves that scatter-gather on the level of
A> > busdma is any worse than chaining on mbuf level?
A> 
A> The problem is different.  With our current jumbo mbufs > PAGE_SIZE there
A> isn't any scatter-gather at busdma level because they are contiguous at
A> physical *and* KVA level.  Allocating such jumbo mbufs shifts the burden
A> of mbuf chains to the VM and pmap layer in trying to come up with such
A> contiguous patches of physical memory.  This fails quickly after some
A> activity and memory fragmentation on the machine as we've seen in recent
A> days even with 96GB of RAM available.  It gets worse the more load the
A> machine has.  Which is exactly what we *don't* want.

I somehow missed the r174247. How did that work since jumbo clusters were
introduced and up to r174247 (3 years)?

AFAIK, it worked fine before jumbos were not physically contigous. Most
modern NICs can use several descriptors per one packet, and that's what
we do when using PAGE_SIZE clusters. But we arm it manually. With a virtually
contigous single mbuf we can just do bus_dmamap_load_mbuf_sg() and do not
care about physical continuity in the stack and upper part of a driver.

I should probably consult gallatin@ and youngari@ on this.

A> > Dealing with contiguous in virtual memory mbuf is handy, for protocols that
A> > look through entire payload, for example pfsync. I guess NFS may also benefit
A> > from that.
A> 
A> Of course it is handy.  However that carries other tradeoffs, some significant,
A> in other parts of the system.  And then for incoming packets it depends on the
A> MTU size.  For NFS, as far as I've read through the code today, the control
A> messages tend to be rather small.  The vast bulk of the data is transported
A> between mbuf and VFS/filesystem.
A> 
A> > P.S. Ok about the patch?
A> 
A> No.  m_getm2() doesn't need the flag at all.  PAGE_SIZE mbufs are always
A> good.

Let me repeat: there is a lot of code, that does handmade allocation of
an mbuf chain of an arbitrary length using mbufs and common clusters. This
code can be cut and use m_getm2(), if we can restrict it to avoid page size
clusters. I don't have time to dig deeper in the code and analyze and test
whether it can support page sized clusters in chains.

Example: netipsec/key.c:key_alloc_mbuf().

A> Calling m_get2() without flag should return at most a PAGE_SIZE
A> mbuf.

This is not logical. A default limit in the middle of possible values
isn't logical.

A> The (ab)use of M_PROTO1|2 flags is icky and may conflict with
A> protocol specific uses.

Please review the patch. The flag isn't stored anywhere, it is just
extension of m_getm2() and m_get2() API.

-- 
Totus tuus, Glebius.