Socket option to configure Ethernet PCP / CoS per-flow

Thank you for the quick feedback.

On a related note - it just occurred to me, that the PCP functionality could be extended to make more effective use of PFC (priority flow control) without explicitly managing it on an application level directly.

Right now, PFC typically degenerates to good-old Flow control, as all traffic is handled just in the default class (0, or whatever is set up using the IOCTL interface API).

Typically, the different Ethernet classes come with a notion of prioritization between them - traffic in a "higher" class may be forwarded prior to traffic in a lower class. But that is not a strong requirement - using WRR with 1/8th bandwidth "reserved" for each class in a switch, assigning flows to a random PCP value, PFC could work in a more scalable fashion - only blocking a fraction of traffic, that is actually queue building (has to go over a lower bandwidth link, or a NIC excessively pausing its ingress), thus reducing the chance of the formation of congrestion trees...

E.g. PCP runs from 0 (default) to 7; 

Adding a socket option to explicitly assign traffic to one of these flows would allow testing and configuring applications to make use of "real" prioritization capabilities of modern switches.

And what I was just pondering was a special interface level setting (e.g. 8), which results in a socket to pick a "random" value when created, to distribute packets across all the queues available in hardware, allowing PFC to no longer collapse in effect to old FC style "on"/"off" for all traffic... 

Perhaps someone here has experience with congestion tree formation in multi-hop switching environments, and can comment if the above approach would be feasible to address that FC issue?

Richard Scheffenegger

> However, while this allows all traffic sent via a specific interface to be marked with a PCP (priority code point), it defeats the purpose of PFC (priority flow control) which works by individually pausing different queues of an interface, provided there is an actual differentiation of traffic into those various classes.
> Internally, we have added a socket option (SO_VLAN_PCP) to change the PCP specifically for traffic associated with that socket, to be marked differently from whatever the interface default is (unmarked, or the default PCP).
> Does the community see value in having such a socket option widely available? (Linux currently doesn't seem to have a per-socket option either, only a per-interface IOCTL API).

I've been doing quite a bit of network testing using iperf3 and similar tools, and have wanted this type of functionality since the interface option became available. Having this on a socket level would make it possible to teach iperf3, ping and other tools to set PCP and facilitate/simplify testing of L2 networks.

So the answer is a definite yes! This would be valuable.

Steinar Haug, Nethelp consulting, sthaug at

