MQ Patch.

Andre Oppermann andre at freebsd.org
Tue Oct 29 20:03:55 UTC 2013


On 29.10.2013 20:35, Randall Stewart wrote:
>
> On Oct 29, 2013, at 2:30 PM, Andre Oppermann wrote:
>
>> On 29.10.2013 11:50, Randall Stewart wrote:
>>> Hi:
>>>
>>> As discussed at vBSDcon with andre/emaste and gnn, I am sending
>>> this patch out to all of you ;-)
>>
>> I wasn't at vBSDcon but it's good that you're sending it (again). ;)
>>
>>> I have previously sent it to gnn, andre, jhb, rwatson, and several other
>>> of the usual suspects (as gnn put it) and received dead silence.
>>
>> Sorry 'bout that.  Too many things going on recently.
>>
>>> What does this patch do?
>>>
>>> Well it add the ability to do multi-queue at the driver level. Basically
>>> any driver that uses the new interface gets under it N queues (default
>>> is 8) for each physical transmit ring it has. The driver picks up
>>> its queue 0 first, then queue 1 .. up to the max.
>>
>> To make I understand this correctly there are 8 soft-queues for each real
>> transmit ring, correct?  And the driver will dequeue the lowest numbered
>> queue for as long as there are packets in it.  Like a hierarchical strict
>> queuing discipline.
>>
>> This is prone to head of line blocking and starvation by higher priority
>> queues.  May become a big problem under adverse traffic patterns.
>
> Thats the whole idea of QOS.. you take and prioritize your traffic
> if you don't have enough b/w.

That is understood.  In most cases it's done on a WFQ basis though and
strict priority is limited to realtime (VoIP) traffic and also bound
overall not to monopolize the entire link if something goes wrong.
Almost all documentation from C and J recommends against unbounded
strict priority scheduling for that reason.

> The guys at the bottom get none..

I wonder how useful an 8 level strict priority actually can be under
load for everything below level 1.  Normally strategic packet loss
as in RED or its more efficient variants together with some WFQ scheme
signals the senders not to increase pace, or actually to slow down a
bit if the link is at capacity.

In practice I've never seen a case where full starvation of lower classes
made any sense.  You'd want at least some packets go through every now
and then even in scavenger class.

> If you don't want it, you can either turn QOS off.. i.e. let
> everything fall to the bottom bucket. Or even set the number
> of queues to 1, and then nothing changes 1:1 queues to transmit-ring

The default setting probably should be the lowest priority available
and then only have the more important stuff get a higher level rather
than the other way around.

>>> This allows you to prioritize packets. Also in here is the start of some
>>> work I will be doing for AQM.. think either Pi or Codel ;-)
>>>
>>> Right now thats pretty simple and just (in a few drivers) as the ability
>>> to limit the amount of data on the ring… which can help reduce buffer
>>> bloat. That needs to be refined into a lot more.
>>
>> We actually have two queues, the soft-queue and the hardware ring which
>> both can be rather large leading to various issues as you mention.
>
>
> Which is why I first of all set the soft-queue default at 64.. That in
> some ways is still big.

If it's MTU sized packets it should be manageable.  If it's TSO chains
though...

> In order to get rid of the hard-queue you really just have to limit
> how much you put in. I have some hooks in for igb here (and em) that
> do this but its just a first step. The right thing (long term) is
> to go to a AQM like Codel or Pi.

I actually wonder if there is any benefit in soft-queuing at all,
except for the multiple-writer concurrency situation.  The DMA rings
are deep enough already.  If they are full just drop the packet without
tacking another soft-queue at the back of it.

> Pi would give you coverage of both queue's at ingress to the first one (thinking
> of a single queue model)
>
> Codel can only handle the soft-> hard queue transition.

Yup.

> But Pi has the standard Cisco patent so it will probably have to be
> a loadable module… sigh..

Haven't looked at Pi yet.  Do you have a pointer to a sufficiently detailed
paper on it?

>> I've started work on an FF contract to rethink the whole IFQ* model and
>
> What is an FF contract?

FreeBSD Foundation.

>> to propose and benchmark different approaches.  After that to convert all
>> drivers in the tree to the chosen model(s) and get rid of the legacy.  In
>> general the choice of model will be done in the driver and no longer by
>> the ifnet layer.  One or (most likely) more optimized models will be
>> provided by the kernel for drivers to chose from.  The idea that most,
>> if not all drivers use these standard kernel provided models to avoid
>> code duplication.  However as the pace of new features is quite high
>> we provide the full discretion for the driver to choose and experiment
>> with their own ways of dealing with it.  This is under the assumption
>> that once a now model has been found it is later moved to the kernel
>> side and subsequently used by other drivers as well.
>>
>>> This work is donated by Adara Networks and has been discussed in several
>>> of the past vendor summits.
>>>
>>> I plan on committing this before the IETF unless I hear major objections.
>>
>> There seems to be a couple of white space issues where first there is a tab
>> and then actual whitespace for the second one and others all over the place.
>>
>> There seem to be a number of unrelated changes in sys/dev/cesa/cesa.c,
>> sys/dev/fdt/fdt_common.c, sys/dev/fdt/simplebus.c, sys/kern/subr_bus.c,
>> usr.sbin/ofwdump/ofwdump.c.
>>
>
> Yeah Fabien Thomas and I have already talked on that.
>
> I had some hold over cruft that I had thought I got out.
>
> The cesa.c changes I committed this AM and the debug stuff was
> all reverted out.
>
> Plus a couple of other little tweaks.
>
> I will resend an updated (cleaned up patch) once my build-universe completes :-)

OK.

>> It would be good to separate out the soft multi-queue changes from the ring
>> depth changes and do each in at least one commit.
>
> I am not sure what you are suggesting here.

The multi-queue and the ring-depth changes in igb(4) et al should be separate
commits because they are distinct features.  The driver maintainer should sign
off on them too before committing.

>> There are two separate changes to sys/dev/oce/, one is renaming of the lock
>> macros and the other the change to drbr.
> Yeah I hit that because the LOCK name unfortunately conflicted with another so
> on one of my build-universe runs LINT would blow up ;-(
>
> That could definitely be done separately..

Please do so.  All separate function units should be done as individual commits
to better track it and also to be able to back them out if there's a problem
with one of them.

>> The changes to sys/kern/subr_bufring.c are not style compliant and we normally
>> don't use Linux "wb()" barriers in FreeBSD native code.  The atomics_* should
>> be used instead.
>>
>
> Those are taken *directly* the original code put in by Kip.. I just moved
> them over when I was refactoring things.

Ugh...

>> Why would we need a multi-consumer dequeue?
>
> I can think of one reason.. its called lagg

Lagg should be hash based so there it could process down through to the real
interface instead of doing such a dance which only re-orders the packets of
the same stream.

-- 
Andre

> R
>
>
>>
>> The new bufring functions on a first glance do seem to be safe on architectures
>> with a more relaxed memory ordering / cache coherency model than x86.
>>
>> The atomic dance in a number of drbr_* functions doesn't seem to make much sense
>> and a single spin-lock may result in atomic operations and bus lock cycles.
>>
>> There is a huge amount of includes pollution in sys/net/drbr.h which we are
>> currently trying to get rid of and to avoid for the future.
>>
>>
>> I like the general conceptual approach but the implementation feels bumpy and
>> not (yet) ready for prime time.  In any case I'd like to take forward conceptual
>> parts for the FF sponsored IFQ* rework.
>
>>
>> --
>> Andre
>>
>> _______________________________________________
>> freebsd-net at freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-unsubscribe at freebsd.org"
>>
>
> ------------------------------
> Randall Stewart
> 803-317-4952 (cell)
>
> _______________________________________________
> freebsd-net at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe at freebsd.org"
>
>



More information about the freebsd-net mailing list