cvs commit: src/sys/netinet in_var.h ip_fastfwd.c ip_flow.c ip_flow.h ip_input.c ip_output.c src/sys/sys mbuf.h src/sys/conf files src/sys/net if_arcsubr.c if_ef.c if_ethersubr.c if_fddisubr.c if_iso88025subr.c if_ppp.c

Fri Nov 14 18:48:26 PST 2003

Luigi Rizzo wrote:
> 
> On Fri, Nov 14, 2003 at 01:02:23PM -0800, Andre Oppermann wrote:
> ...
> >   Introduce ip_fastforward and remove ip_flow.
> >
> >   Short description of ip_fastforward:
> >
> >    o adds full direct process-to-completion IPv4 forwarding code
> >    o handles ip fragmentation incl. hw support (ip_flow did not)
> >    o sends icmp needfrag to source if DF is set (ip_flow did not)
> >    o supports ipfw and ipfilter (ip_flow did not)
> >    o supports divert, ipfw fwd and ipfilter nat (ip_flow did not)
> >    o returns anything it can't handle back to normal ip_input
> 
> Sorry if i missed part of the discussion leading to this commit...

Was discussed on -current and -net for the past week.

> Let me state first that I do not mean to criticise your work, but
> having spent quite a bit of time trying to optimize the forwarding
> path myself, and having found a number of surprises on what really
> consumes most of the time, i tend to be a bit skeptical when it
> comes to optimizations.
> 
> Obviously I see it as a very good thing the creation of a fully
> featured fast forwarding path. However I would like to understand
> if it is really worthwhile doing it, and whether this new code
> isn't a replica of already existing optimizations.

It isn't a replica of the existing code. Of course it replicates
some code and checks.

> In particular, did you measure any significant performance difference
> by replacing calls to ip_fastforward() with calls to ip_input() ?
> (say in a kernel without "options IPSEC" which used to be highly
> inefficient even for non-ipsec packets).

Yes, the forwarding delay (hop delay) is a little bit less than
the ip_input way but the jitter is much less. I don't have measurements
down to the cpu instruction cycle yet and it's not yet totally
optimized. The goal of this first version is to be fast and a
robust foundation for more optimizations. See more comments further
down.

> The reason I am asking is that I see some expensive ops (such as
> doing route lookups, scanning through lists of local addresses,
> input and output firewall processing, fragmentation) in both pieces
> of code. Fragmentation may not be the common case, but all the rest
> definitely is, and with large routing tables, multiple interfaces
> and even modestly complex firewall rulesets, most of the processing
> time goes there.

Doing route lookups is the core function of every router, so I don't
see any way how to get rid of that. ;-)  The fastforward code can
serve as basis to implement better routing tables like highly compact
and pre-computed LCtries. It is almost impossible to do anything like
that in the normal ip_input and so on forwarding path due to its
very loomy relationship with the BSD routing table.

The interface stuff is not as bad as it looks on the first glance.
Normally a interface has only one ip address on it and that makes
the foreach loop very small. I have one optimization idea for that
already. Since we require that check only to determine whether we
have to fallback to ip_input, the local ip address list can be
either locally cached or built into a perfect hash structure which
is being regenerated on every change in interface addresses (which
is once every n-million packets). We don't need any pointers so
even having a stale or not-yet entry for a second or so ain't a real
problem.

Against complex firewall rules I can't do anything. The user is free
to shoot himself into the foot. Anyway, with ipfw2 it's much less of
a problem than it used to be.

> Also, in the initial comments you claim:
> 
>  * ip_fastforward gets its speed from processing the forwarded packet to
>  * completion (if_output on the other side) without any queues or netisr's.
> 
> now, netisr_dispatch() (which is invoked if ip_fastforward() fails)
> does an inline call to ip_input() if netisr_enable=1 (ip_input
> is NETISR_MPSAFE). So this seems to fall into the same category
> of optimizations.

As it looks at the moment this can't be enabled because there are
many locking implications later on in the tcp and udp code. Also
the stack gets very large due to so many functions which get called
after each other.

> So to summarize, could you identify what are the differences
> that make ip_fastforward() so much faster than
> ip_input()/ip_forward()/ip_output() ?

At the moment it's two things. One is process to completion with
a small stack. Second is far less branches and checks (multicast,
etc) than in the ip_input(), ip_forward() and ip_output() chain.

In the future it can be optimized further as I've outlined before.
And these optimizations can be made for the forwarding path only,
without having to mangle it into the big giant mess of ip_input
and friends.

> Given that there are large segments of common code between
> ip_fastforward() and ip_input()/ip_output() (i am thinking of the
> entire ipfw handling code, for one, and also some basic integrity
> checks, the fragmentation code, etc.) I also wonder if it wouldn't
> be beneficial to put the optimizations into the standard path rather
> than create a new (partial) replica of the same code, with the
> potential of introducing bugs, and with some substantial I-cache
> pollution which might well destroy the benefits of minor optimizations.

I don't see much cache pollution here. Normally you use ip_fastforward
for fast packet forwarding where you want the cpu to touch as little
as possible to handle as many packets (different) as possible.

The cache pollution with ip_input and friends is far higher due to
its large stack usage.

If we actually have something like a packet flow the second and so
on packet would probably get all lookups out of the cpu cache.

The only case with some possible cache pollution is the dropback
case in situations where you are doing forwarding but most packets
are destined for an interface on the local machine.

For potential bugs. Sure all new code can have more bugs. I've chosen
to comment the code very well and write down the what is otherwise
happening in ip_input and friends in the dark. If there is actually
a bug it should be very easy to spot. On a side note; the ipfw api
is kinda difficult to understand and work with. It took me some time
to get it.

> -----------
> 
> Minor comments on the code:
> 
>   + one of the initial comments in the new code states
> 
>         ... The only part of the packet we touch with the CPU is the
>         IP header. ...
> 
>     this is not true if you use ipfw because that code touches many
>     places in the packet (and can also do some expensive computation
>     like trying to locate the uid/gid of a packet; the fact that we
>     only deal with packets not for us does not prevent the existence
>     of such firewall rules).

Well, as I said, everybody is free to shoot himself with such highly
complex firewall rules. I'd say the ipfw code could be optimized with
some of the ideas I've specified earlier. I don't think the ipfw code
would do a uid/gid lookup if neither the destination nor source is
local (there is simply no socket to check against). But that is pure
speculation as I haven't researched or looked at that code yet.

>   + The code at line 344:
> 
>         if (fw_enable && IPFW_LOADED) {
>                 bzero(&args, sizeof(args));
>                 args.m = m;
>                 ipfw = 0;
> 
>                 ipfw = ip_fw_chk_ptr(&args);
>                 m = args.m;
>    the bzero is probably more expensive than individually zeroing the
>     fields of args as done in ip_input() (I think once i measured the
>     two solutions). The first "ipfw = 0;" is redundant.

I chose a safe programming style with the bzero. So I am sure that
every bit in the structure is zero, also when it changes some day.
The ipfw = 0 is redundant indeed. Will remove it the next time I'm
touching that file.

>   + could you clarify the divert logic ? I am a bit rusty with that
>     part of the code, but i am under the impression that in
>     ip_fastforward() you are passing along args.divert_rule and
>     losing track of divert_info which is instead what you need too.

It's not you being rusty, the code is indeed hard to follow. :-/
divert_info is used for ip packet reassembly. ip_divert() is then
just using it to determine whether the packet was catched on the
way into the machine or out of it. It seems to have it's largest
significance for the ip reassembly. I've tested that too with an
earlier version of my code. However I will redo those tests to be
sure it is working as expected.

Diverting and natd'ing works very well with ip_fastforward, so it
can't be completely wrong what I did. ;-)

Thanks for your input.

-- 
Andre