svn commit: r248196 - head/sys/nfs

Tue Mar 12 21:19:10 UTC 2013

On 12.03.2013 20:49, Gleb Smirnoff wrote:
>    Andre,
>
> On Tue, Mar 12, 2013 at 05:33:04PM +0100, Andre Oppermann wrote:
> A> > We have very common case when we allocate either mbuf or mbuf + cluster,
> A> > depending on size. Everywhere this is made by hand, but can be substituted
> A> > with m_get2(len, ..., M_NOJUMBO);
> A>
> A> I guess what I'm trying to say is that not wanting jumbo > PAGE_SIZE is
> A> normal and shouldn't be specified all the time.
> A>
> A> This makes the API look like this:
> A>
> A>   m_get2(len, ..., 0);	/* w/o flags I get at most MJUMPAGESIZE */
> A>
> A> If someone really, really, really knows what he is doing he can say
> A> he wants jumbo > PAGE_SIZE returned with M_JUMBOOK or such.  However
> A> IMHO even that shouldn't be offered and m_getm2() should be used for
> A> a chain.
>
> Once again: there are cases when chain is not anticipated, a single
> contigous mbuf is.

My statement was at a design level.  So far only 2K sized contiguous mbufs
(as in classic clusters) were available.  This can be relaxed to 4K page
sized mbuf clusters without any trouble.  Not that we have any of those
requiring more than 2K clusters today.

What I'm saying is that everything that requires > 4K clusters is broken
by design.  It should be designed differently to use chains of mbufs not
more than 4K each.

There are architectures whose page size is larger than 4K.  However the
majority is at 4K.

> A> > A> However I think that m_get2() should never ever even try to attempt to
> A> > A> allocate mbuf clusters larger than PAGE_SIZE.  Not even with flags.
> A> > A>
> A> > A> All mbufs > PAGE_SIZE should be exclusively and only ever be used by drivers
> A> > A> for NIC's with "challenged" DMA engines.  Possibly only available through a
> A> > A> dedicated API to prevent all other uses of it.
> A> >
> A> > Have you done any benchmarking that proves that scatter-gather on the level of
> A> > busdma is any worse than chaining on mbuf level?
> A>
> A> The problem is different.  With our current jumbo mbufs > PAGE_SIZE there
> A> isn't any scatter-gather at busdma level because they are contiguous at
> A> physical *and* KVA level.  Allocating such jumbo mbufs shifts the burden
> A> of mbuf chains to the VM and pmap layer in trying to come up with such
> A> contiguous patches of physical memory.  This fails quickly after some
> A> activity and memory fragmentation on the machine as we've seen in recent
> A> days even with 96GB of RAM available.  It gets worse the more load the
> A> machine has.  Which is exactly what we *don't* want.
>
> I somehow missed the r174247. How did that work since jumbo clusters were
> introduced and up to r174247 (3 years)?

jumbo != jumbo.  They are simply called jumbo because they are larger than
the traditional 2K cluster.  The PAGE_SIZE jumbo is special in that it is
the largest mbuf that doesn't require any special VM or physical trickery
to be allocated.  This it is available as long as you have a single free
page in the system.  A page can be divided into two classic 2K clusters.
The only difference to a 4K PAGE_SIZE jumbo being that it requires two
256B mbuf for descriptors instead of one.

> AFAIK, it worked fine before jumbos were not physically contigous. Most
> modern NICs can use several descriptors per one packet, and that's what
> we do when using PAGE_SIZE clusters. But we arm it manually. With a virtually
> contigous single mbuf we can just do bus_dmamap_load_mbuf_sg() and do not
> care about physical continuity in the stack and upper part of a driver.

Jumbos always were physically contiguous.  That was the very reason for
their existence.  They were invented to allow early NIC's with limited
DMA capabilities to handle jumbo ethernet frames.  Today they should only
be used for DMA challenged NIC's and nothing else.

For PAGE_SIZE jumbo's there is only one descriptor being used.  A page is
largest physical unit on the CPU architecture that always is contiguous.

What you apparently want is a new type of jumbo mbuf that is only contiguous
in virtual address space but not physically.  In theory that could be done,
but it's totally different from the current jumbo mbufs we already have.
While you get rid of the severe physical memory fragmentation problem the
KVA fragmentation problem remains lurking somewhere in the back.

Actually you could probably even do arbitrarily sized KVM contiguous-only
mbufs on demand.  The overhead cost is then simply shifted to the VM from
the mbuf user that had to work with mbuf chains.

> I should probably consult gallatin@ and youngari@ on this.
>
> A> > Dealing with contiguous in virtual memory mbuf is handy, for protocols that
> A> > look through entire payload, for example pfsync. I guess NFS may also benefit
> A> > from that.
> A>
> A> Of course it is handy.  However that carries other tradeoffs, some significant,
> A> in other parts of the system.  And then for incoming packets it depends on the
> A> MTU size.  For NFS, as far as I've read through the code today, the control
> A> messages tend to be rather small.  The vast bulk of the data is transported
> A> between mbuf and VFS/filesystem.
> A>
> A> > P.S. Ok about the patch?
> A>
> A> No.  m_getm2() doesn't need the flag at all.  PAGE_SIZE mbufs are always
> A> good.
>
> Let me repeat: there is a lot of code, that does handmade allocation of
> an mbuf chain of an arbitrary length using mbufs and common clusters. This
> code can be cut and use m_getm2(), if we can restrict it to avoid page size
> clusters. I don't have time to dig deeper in the code and analyze and test
> whether it can support page sized clusters in chains.

m_getm2() can be used in any such case and doesn't have to avoid PAGE_SIZE
clusters.  PAGE_SIZE jumbo clusters are fine.  Larger than PAGE_SIZE is not.

When the code is able to work with mbuf chains the exact size of each cluster
isn't important anymore.  We can use the optimal size.

> Example: netipsec/key.c:key_alloc_mbuf().
>
> A> Calling m_get2() without flag should return at most a PAGE_SIZE
> A> mbuf.
>
> This is not logical. A default limit in the middle of possible values
> isn't logical.

It is very logical based on the fact that a page sized cluster can always
be allocated as long as there is an available free page in the system.
This isn't true for jumbo clusters larger than page size due to the
additional constrains.  The largest "natural" size of an mbuf (jumbo)
cluster is equal the page size, typically 4K.

> A> The (ab)use of M_PROTO1|2 flags is icky and may conflict with
> A> protocol specific uses.
>
> Please review the patch. The flag isn't stored anywhere, it is just
> extension of m_getm2() and m_get2() API.

OK.  However it's the wrong way around.  Not allocating > page size
jumbos should be normal when no flags are set.

-- 
Andre