ipfw, status report

Tue Jul 21 13:16:15 UTC 2009

Hello,

the following is a report of the work done in this first phase
of the GSOC project "ipfw and dummynet improvements".
I have mostly followed what was specified in the workplan, namely:

 - moving kernel ipfw+dummynet sources to a separate directory.
   This has been committed to trunk by my mentor around rev.193532

 - the split of headers has been postponed because i am still working
   on untangling the various components (pipes, queues, flowsets,
   schedulers) used in dummynet. As this will take more time than
   expected, and involves the kernel-userland API,
   I am still discussing with my mentor on the best approach to use.

 - replacing the giant switch in ipfw_chk() with one or more dispatch
   tables has been done. The code, working against the luigi svn branch,
   is here:
        http://info.iet.unipi.it/~marta/ipfw/ip_fw2.c_r2998

 - i have run a set of performance measurements to profile the ipfw_chk()
   execution times, before and after the change.

The motivation for removing the giant switch was mostly readability,
even though eventually it might help to optimize some common checks
(e.g. by providing different dispatch tables for IPv4 vs IPv6 and so on).

The replacement was done as follows:
- a first pass i removed all the 'goto' (and corresponding labels) from
  the body of ipfw_chk, adjusting the code and variables to support
  clean exit from the nested loops;
- at this point, each 'case' of the switch was encapsulated in a function;
- the body of the main switch was replaced by a call of the functions
  through a dispatch table. At the moment, there is only one table
  irrespective of the type of packet (ipv4, ipv6, layer2).
- further cleanup was replacing the pullup macro with a function
- all changes were extensively commented;

The resulting code has a better readability: the huge switch is just
a single line calling the dispatching table function; the main loops
around rules parsing and microinstructions code are cleary visible.

Before and after this changes I've done some performance measurements
in order to evaluate the impact of the changes done. Since the main
architecture of the code was unmodified, I did not expect big changes,
and the only unknown was the impact of the indirect call with respect
to the direct jump using a switch(), and possible optimizations that
might have been lost because the code is not inline anymore.

Instead of heavily instrumenting the code, i ran the tests doing
a set of pings with different kernel sources (before and after the change),
and with different ipfw configurations, and plotting the distribution
of ping times. The resolution of the measurement is 1us, and the
base level for the ping times is around 70us (at least for the machine
and the 100Mbit/s switched network I the was working with).

To see the effect of the changes (which might be in the nanoseconds range),
in most of the tests i forced the loops to be run many times, either
with explicit ipfw configurations (e.g. 100 'count' instructions, possibly
with multiple microinstructions for each rule), or wrapping the call
to ipfw_chk() in an explicit loop which is run 100 times.
This gives slightly better resolution without requiring heavier
modifications
to the code to read the TSC and report the values to userland.
Experiments with the TSC will be done later.

In detail, the following test cases were considered:

 A. 1, 10 or 100 simple rules with 1 microinstruction each,
           (count proto icmp);
 B. 10 or 100 rules with 5 microinstruction each,
   (count proto icmp not proto tcp not proto udp not proto tcp not proto
udp);

These have been repeated with three versions of the code
 "switch"       the giant switch that is in HEAD
 "dispatch"     the dispatch table
 "dispatch100"  the dispatch table and a wrapper around ipfw_chk()
                that calls the function 100 times on each packet;

The tests were run on RELENG_7, HEAD, and linux 2.6.28 using a 500 pings
(ping -c 500 -i 0.05) from a computer connected by a 100Mbit full duplex
switch. The distribution of the response times was then plotted and
I took as a reference the values at 20% of the distribution.

Both client and machine under test were unloaded, so the distribution
curves were mostly flat up to 80-90% of the samples.

Results are in the following table:

        test case               switch          dispatch        dispatch100
        HEAD-A-1                79 (79)         79 (79)         98 - 98
        HEAD-A-10               80 (80)         80 (80)         200 - 200
        HEAD-A-100              88 (86)         89 (87)         1040 - 1050
        HEAD-B-1
        HEAD-B-10               81 (80)         81 (81)         316 - 315
        HEAD-B-100              95 (94)         100 (97)        2203 - 2203

        RELENG_7-A-1            74
        RELENG_7-A-10           75
        RELENG_7-A-100          82
        RELENG_7-B-1
        RELENG_7-B-10           76
        RELENG_7-B-100          90

        linux(*)
        linux-A-1               75              75              111 - 107
        linux-A-10              76              76              205 - 207
        linux-A-100             85              85              1038 - 1041
        linux-B-1               75              75              113 - 106
        linux-B-10              78              77              323 - 315
        linux-B-100             99              96              2170 - 2170

Values reported on "()" are related to an amd64 system.
(*) the linux tested system runs 2.6.28-11-generic (i686) linux kernel on
Ubuntu.
the values with the "-" are further tests done in the same conditions.

Looking at the curves (http://info.iet.unipi.it/~marta/ipfw/report2_plot/)
one can see that the introduction of the dispatch table does not affect
the execution time of the code. Test cases with few rules are almost
the same, while slight differences arise while evaluate more rules with
complex microinstructions. (See curves HEAD-B-10 and HEAD-B100 for each
case)

Also from the above and other measurements (done against a linux system
on the same pc) we can derive the time spent in each of the phases of
ipfw_chk() processing, namely:

- entering ipfw_chk() and setting up variables for
  the processing of instructions (1 RWlock);            100ns
- processing the rule header and action                  50ns
- processing a simple microinstruction.                  12ns

As expected, results does not show significant differences in terms
of performances between the switch and the dispatching table versions,
but the resulting code is definitely more readable respect the old one.

I'm planning to finish the code and data structure reorganization,
and to start the next task related to the kernel/userland interface
efficiency.

marta