git: 5091ca26507b - main - pf: save on branching in the common case in pf_test

Fri Sep 17 17:59:02 UTC 2021

  Mateusz,

On Tue, Sep 14, 2021 at 05:11:11PM +0200, Mateusz Guzik wrote:
M> > On Fri, Sep 10, 2021 at 09:43:06AM +0200, Mateusz Guzik wrote:
M> > M> > M> diff --git a/sys/netpfil/pf/pf.c b/sys/netpfil/pf/pf.c
M> > M> > M> index e2dd3eb7c0de..add76c7b98d4 100644
M> > M> > M> --- a/sys/netpfil/pf/pf.c
M> > M> > M> +++ b/sys/netpfil/pf/pf.c
M> > M> > M> @@ -6151,7 +6151,7 @@ pf_test(int dir, int pflags, struct ifnet
M> > *ifp,
M> > M> > struct mbuf **m0, struct inpcb *
M> > M> > M>
M> > M> > M>  	PF_RULES_RLOCK();
M> > M> > M>
M> > M> > M> -	if (ip_divert_ptr != NULL &&
M> > M> > M> +	if (__predict_false(ip_divert_ptr != NULL) &&
M> > M> >
M> > M> > This is an optimization for a setup without divert(4) at cost of
M> > M> > pessimization
M> > M> > for a setup with divert(4). IMHO, __predict_false() predicate should
M> > guard
M> > M> > against error paths and other unusual events, not favor one setup over
M> > M> > other.
M> > M> >
M> > M>
M> > M> divert is non-standard and comes with tons of overhead, so I think
M> > M> this is justified.
M> >
M> > pf is also non-standard and comes with tons of overhead, so what
M> > about such change to ip_input(), and similar to ip_output():
M> >
M> > diff --git a/sys/netinet/ip_input.c b/sys/netinet/ip_input.c
M> > index 465c00e4dac..e6feb837114 100644
M> > --- a/sys/netinet/ip_input.c
M> > +++ b/sys/netinet/ip_input.c
M> > @@ -605,7 +605,7 @@ ip_input(struct mbuf *m)
M> >          */
M> >
M> >         /* Jump over all PFIL processing if hooks are not active. */
M> > -       if (!PFIL_HOOKED_IN(V_inet_pfil_head))
M> > +       if (__predict_true(!PFIL_HOOKED_IN(V_inet_pfil_head)))
M> >                 goto passin;
M> >
M> >         odst = ip->ip_dst;
M> >
M> > I hope that brings my point.
M> >
M> > M> The real fix would be to either implement hot patchable branches or
M> > M> otherwise generate pf_test always (or never) doing divert without
M> > M> having to check for it, but that's far in the future.
M> >
M> > That would be great, but before we have such tools, I believe optimization
M> > for one particular setup at cost of pessimization of other setups is not
M> > a way to go.
M> >
M> 
M> So happens I would argue the pfil change should also be made, perhaps
M> conditional on whether a known consumer is compiled in.

Thanks for reply, although makes me think that you wouldn't reply anything
unless you really want to proceed with my hyperbolic suggestion :(

How deep are you going to go this path? May be

if (__predict_true(ip->ip_proto == IPPROTO_TCP && th->th_port == ntohs(443)))

A suggestion to get optimization for "non-standard" stuff by compiling it
statically sounds like a jump 20 years to the past.

M> I guess I should elaborate on how I see things here.
M> 
M> The network stack comes with a rampant branch fest which can be
M> significantly reduced in easy ways. For example from ip_input:
M> 
M> #if defined(IPSEC) || defined(IPSEC_SUPPORT)
M>         /*
M>          * Bypass packet filtering for packets previously handled by IPsec.
M>          */
M>         if (IPSEC_ENABLED(ipv4) &&
M>             IPSEC_CAPS(ipv4, m, IPSEC_CAP_BYPASS_FILTER) != 0)
M>                         goto passin;
M> #endif
M> 
M> Let's say this has to be checked every time. Even then IPSEC_CAPS is a
M> func call which induces 2 more func calls, which also branch for no
M> reason. Instead the complete result could be computed so that there is
M> one bool to check (and even that could be hot patched later on).
M> Finally, it should probably be predicted one way or the other.
M> 
M> pf_test is an egregious example of this proble and, the commit at hand
M> was a step towards cleaning the state up -- it is not meant to be
M> permanent in the current form. The idea is to gradually split it into
M> parts which have to be there and parts which are optional, the
M> annotation will help going forward and should not measurably hurt
M> divert.

I understand your concerns, but this is the reality of network. A machine
supports multiple protocols and encapsulations or whatever the same time.
I understand, that some machines and workloads would work 99% of the time
on just one protocol. So you create a benchmark that exercises a particular
branch of code for all of the packets and then do optimizations for this
particular branch.

To justify such changes at least you need to have a second benchmark
that would measure how much of pessimization you created for a setup
that would have 99% prediction mismatch and document the results in
the commit message, like:

- improves packet rate without pf hooked in by X%
- pessimizes packet rate with pf (empty ruleset) by Y%

-- 
Gleb Smirnoff