[HEADS UP!] IPFW Ideas: possible SoC 2008 candidate

Vadim Goncharov vadim_nuclight at mail.ru
Sun Mar 23 18:55:18 UTC 2008


[Sorry if it is too late for SoC, but I was unexpectedly busy last 3 days
and couldn't finish this text earlier.]

This is a proposal for ipfw improving ideas and architectural changes.
Some of them are independent of each other and could be implemented
without ABI breaking in STABLE, but, whether all of these will be a
SoC 2008 candidate or not, should be finally implemented in FreeBSD.
The only question is what should be corrected, so please discuss it :)

This text also includes slightly changed and/or generalized ideas from:

All syntax examples are only to give idea, this should be discussed.

1. Major changings (ABI breaking is necesary).

1.1. Dynamic rules reorganizing.


Current ipfw's dynamic rules are not suitable for several advanced
tricks. For example, it is not possible to use saved information about
current state of connection in the firewall rules elsewhere, and it is
not possible to change that state from firewall also.

Wanted features:

 * Ability to create/delete dynamic rule in any state via some API or ABI
   from all parts of system: userland, ipfw rules, other kernel modules.
   This can be useful for:

   a) Creating dynamic rule in the middle of connection, not only setup:

      ipfw add pipe 1 ip from any to any tagged 412 keep-state-middle

      This allows to change handling of connection after some event,
      e.g. L7 filtering by ng_bpf + ng_tag discovered that a connection
      belongs to some class by analyzing packet payload, and from now on
      connection should go directly with dynamic rules, but never sent
      again to expensive L7 processing.

      Currently you can use just "keep-state" for this, but ipfw will
      not see SYN's and rule will be subject to sysctl
      net.inet.ip.fw.dyn_rst_lifetime - by default expires after
      1 second, which is undesirable for many cases.

   b) Ability to save/load dynamic rules in userland with files, e.g.,
      to continue after reboot.

   c) Ability to exchange with rules state with other machine with ipfw,
      e.g., two firewalls in a CARP failover.

   d) Creation of rule with specified state and parameter before actual
      connection would be established. E.g. imagine a by-default-closed
      firewall with a netgraph(4) module analyzing FTP control
      connection and giving commands to ipfw to open dynamic "holes" for
      data connections, thus elimanating current practice of opening
      ports in the entire range 1024-65535 (insecure, yes).

   One can think about providing direct exchange between libalias(3)'s
   alias_link and ipfw dynamic rules, but that's a subject for further

 * Additional fields in dynamic rules to keep arbitrary info for
   specific connection, and opcodes for loading and storing that values
   from other parts of firewall or elsewhere. This will allow to
   implement a pf(4)'s "scrub" maximum TTL enforcing on connection,
   but not only that - generic data storage allows any future extensions.

 * Ability to change dynamic rule's parent rule "on the fly" (just changing
   a pointer to which static rule's ACTION_PTR to jump, yes). The latter
   will allow aforementioned distinguishing of connection packets
   before/after L7 processing in the case where packets are always
   classified to flows before any processing takes place - that example
   with "keep-state-middle" assumed that main firewall is stateless,
   only L7-matched packets are subject to be dynamic. And this allows to
   reassign an action for dynamic rule:

   ipfw add 100 check-state
   ipfw add 200 skipto 500 ip from any to any keep-state
   ipfw add 500 netgraph 41 ip from any to any
   ipfw add 600 change-parent 800 ip from any to any tagged 412
   ipfw add 700 allow ip from any to any
   ipfw add 800 pipe 1 ip from any to any

 * More types for dynamic rules system would allow not only "keep-state"
   and "limit", but rather be extensible to something more. E.g., current
   "limit" rules just drop packets if limit is reached - but user
   possibly wants an option to process them with another rule afterwards.

Possible implementation:

 * For arbitrary info: add a union of one uint32_t or two uint16_t's or
   four uint8_t's two each dynamic rules and operations to load/store
   those values (or may be an uint64_t and two uin32_t's and so on?..).
   Also add one void* to allow to store more data if one needs to.

 * Make a special netgraph node (or extend ng_ipfw) which will broadcast
   every change in dynamic rules to all it's hooks (how many to bundle
   into one mbuf should be customizable). Every input with structs of
   the same format will result in addition or deletion of dynamic rules
   in ipfw. A netgraph node method of work provides flexible and extensible
   way to manipulate dynamic rules: you can connect to it protocol-trackers
   which will insert rules for secondary connection (e.g. FTP); you can
   connect to it userland tool which will log all dynamic rule changing
   or will do load/save of rules in a file; you can connect to it an
   ng_ksocket(4) node with UDP to broadcast to someone or TCP to connect
   to another machine with the same setup to provide CARP failover.
   Note that node should not do delivery/retransmission checks as
   pfsync(4) does, because this is a task for someone other (to keep
   modularity), but two such nodes on different machines connected to
   each other should provide automatic rules synchronizing without
   additional actions after initial setup.

1.2. Userland (and other subsystems) interaction, modularity, rulesets.


Currently /sbin/ipfw2 is a custom-made parser which communicates with
the kernel via setsockopt() calls. It is sometimes hard to extend with
new features due to complex code. Using a socket instead a /dev entry
means you always need to be root (uid 0) to both read firewall
configuration and to change it. In-kernel protocol is also sometimes
hard to extend, while some addional entire-ruleset features are useful.

Wanted features:

 * Parser's code (sbin/ipfw2.c) should be reviewed and possibly
   rewritten using lex(1)/yacc(1). Syntax is ocmplicated, however, and
   it may be not possible to not implement all of it exactly. This
   should be further investigated.

 * It may be desirable to give some other user ability to at least read
   config and may be to write, as /dev/bpf* permissions allow it for

 * Device entry could also improve modularity: currently to add a new
   IP_FW_* socket option, you have to modify netinet/raw_ip.c, which
   means you can't just recompile /sbin/ipfw and ipfw kernel module.

 * The same applies to other ipfw-related facilities: dummynet, divert,
   NAT. It can be good to keep them configurable by some other means
   rather than tweaking raw_ip.c. It can be useful to separate dummynet
   and divert to it's own facilities to be able to use them without
   ipfw, e.g., from netgraph(4). Related to this is a problem with IPSEC
   interaction - if you use it with divert(4) on output, then on return
   from divert packets will be IPSEC'ed again because in ip_output()
   IPSEC is called before pfil(9). It could be useful to add an option
   for user (in addition to existing behaviour, to not break POLA) to
   call IPSEC processing from specified place in ruleset just like all

   ipfw add ipsec ip from any to any out

 * As patch about using rule counters is currently discussed in ipfw@, it
   is useful to add ability to change rule counters to arbitrary values
   rather than providing the only "zero" action. This is closely related
   with an option of restore ipfw's static ruleset without losing
   counter values. Currently you can save "ipfw list" to file, do an
   "awk '{print "add " $0}'" on it and then load it again (e.g. after
   reboot). It must be possible to do the same with "ipfw show". Syntax
   example for providing counters with "ipfw add" - all cases are
   distinguishable (current syntax allow only first two):

   ipfw add allow ip from any to any            # select next rule number
   ipfw add 100 allow ip from any to any        # exact rule number specified
   ipfw add 1234 76845 allow ip from any to any     # counters without rulenum
   ipfw add 100 1234 76845 allow ip from any to any # rulenum and counters

 * Static ruleset loading and saving is closely related with ruleset
   precompilation and atomic commits. Imagine a rulesets with thousands
   of rules: if a packet arrives in the middle of ruleset updating,
   strange effects can occur. Of course, you can achieve the same
   results with sets, by disabling new set and atomically swapping them
   later, but that is not always comfortable. Precomplilation of the
   whole ruleset and then atomically installing it ("transaction commit")
   requires an implementation which will also allow saving and loading
   precompiled ruleset in binary form - good for routers where 20K-rules
   script can be processed for several minutes.

 * Precompiled binary rules can also be used for the same rule setting
   from both other kernel subsytems and other machines (CARP again).
   Thus, generic binary rule format/protocol (not only for /dev) might
   be invented. Moreover, compiled ruleset format may be different from
   current linked list, which has disadvantages of both initial "skipto"
   (and planned "call/return", see next section) and disabled-set-rules
   are still traversed. Precompiled form of opcodes-only allows to do
   quick jumps, easy running of cross-rule optimizations (and even
   possibility to compile ipfw opcodes to machine code like BPF_JITTER
   for bpf(4) for more speed). This has disadvantages of separate rule
   counters keeping and not-so-transparent need for user to recompile
   every time, so should be further investigated.

 * About several rulesets, for different interfaces (or hacks like
   per-interface setting of rule number to jump to on it): I think that
   this is unnecessary and unfriendly to user - having one rulesets is
   simpler, and you usually need common checks on packets. So "commit"
   precompiled rules, "call/return" actions (see next section) and stack
   virtualization via "vimage" should serve all practical purposes.

Possible implementation:

General view is clear from features description. One also can think about
netgraph(4) node for this (again) and/or something like shared memory
pages between kernel and userland, to not allocate memory in kernel
twice for big rulesets.

2. Independent (minor) changes, which can be possible without ABI breakage.

2.1. call/return rule actions.

Description of feature:

A "skipto" rule is known as a useful tool to optimize packet flow
through ruleset, also able to assign several actions to a dynamic rule
(because dynamic rule on match simply jumps to action part of parent
rule). But it can only jump forward, not backwards, for the same reason
as bpf(4) assember instruction: to prevent infinite loops in packet flow
which will cause machine to hang network operations. This can be
addressed by introducing a pair of instructions, call and return, which
remembers position to return in the stack of some kind. Because return is
always done to the next rule after calling one (by number, as with
divert/skipto), it is guaranteed that infinite loops can't occur, even
in case of calling one rule many times by simply proceeding to next rule
after stcak overflow.

Thus call/return pair allows to organize some kind of subroutines, with
the trick that issuing actual number lets to jump to the middle of
subroutine, as in assembly language:

   ipfw add 100 call 600 ip from any to any in recv $internal
   ipfw add 100 call 700 ip from any to any in recv $external
   ipfw add 500 allow ip from any to any
   ipfw add 600 deny ip from any to any not antispoof
   ipfw add 700 deny tcp from any to any 135,445
   ipfw add 900 return // for both those calls

It should be noted again that calls are made by rule numbers, so in the
following example the first "call 700" will pass control to rule 301,
not second rule 300.

   ipfw add 300 call 700 ...
   ipfw add 300 call 800 ...
   ipfw add 301 count ip ...

Allowing to use "tablearg" in "call" would be very useful. Parser should
allow both version of "return", with some conditions (ususal rule body)
and without them (like "check-state").

Possible implementation:

Relatively easy. Allocate a mbuf tag for a stack of uint16_t rule
numbers and a stack top pointer on first "call" for mbuf. The only thing
to care are divert etc. calls, and distinguishing input and output
passes (firewall can be called several times for each), thus stack
underflow and overflow should be carefully analyzed. May be two tag
types, one for input and one for output.

It is difficult, however, to get this performing well, because of
linked-list nature of ruleset and inability to cache pointer to skip
destination, as done with "skipto" currently, because there can be
several locations (even tablearg). Possible solutions may be to keep
a cache to, say, 256 points in the list (rulenum / 256) to reduce
looking after this point (effectively equivalent to hash on rulenum).
Or to have compiled rulesets where offset to jump is easily calculated
(see previous section).

2.2. Tables and tableargs.

Tables are very powerful way to both increase processing speed and
conveniently reduce rule maintaing cost for user, especially with
tableargs. Tables, however, are currently limited to IPv4
addresses/masks as keys and uint32_t's as values. Table keys should be
extended to another data types: IPv6 addresses, interface name strings:

   ipfw add allow ip from any to table6(1) in recv stringtable(2)


   ipfw call tablearg ip from any to any via stringtable(3)

The latter will be very handy for routers with e.g. 2000 VLAN or ng*
interfaces, with separate client and rules for each.

Tableargs should also be expanded to 16 bytes, to be able to store IPv6
address ot uint64_t for checking e.g. in rule counters. It is
questionable whether tableargs could also be short (< 16 bytes) strings
like interfaces' names.

Due to implementation difficulties of distinguishing whether action
parameter is a valid value or a tablearg (you usualyy have only one
invalid value out ouf 65536 which is get assigned as tablearg
indicator), I suggest adding operations like "settablearg" which will
set tablearg without actual table used, e.g., from saved arbitrary info
from dynamic rules (see section 1.1) or even packet header. So, values
for "computed goto" or something like registers still be used by
tablearg (just generalizing definition of table), or, at least this
should be so in opcode level - user could be present with some other
keyword, but I don't see any point in hiding this details.

Number of tables of all types should be configurable via sysctl or at
least loader tunable rather than current hradcoded number (128).

2.3. Time limit counter.

An opcode for a token bucket and/or leaky bucket should be introduced.
This will have a one counter changed with timer and other changed by
actual packets. We currently have O_LOG opcode looking similar to this,
but O_LOG has nothing to deal with timer. Proposed opcode must be useful
at least for limiting a number of connections per second, but any other
possible use is appreciated, from simplest shaping without dummynet to
more exotic like counter "price" coefficinets allowing to build an
in-kernel billing solely on ipfw counters.

It is questionable where values of counters should be stored, due to
locking optimising - directly, as with O_LOG, or separately addressable
space like tables.

2.4. Action rules and parameters.

Change ACTION_PTR handling in kernel and preparing in compiler to allow
actions and their parameters to be placed in any order (except for
opcodes where order is required, e.g. prob). This would easily allow
placing several opcodes of the same type to action part, e.g.:

   ipfw add count tag 1 tag 2 tag 3 ip from any to any

and using actions and their parameters interchangeably, like having
a rule without actual action opcode (only parameter instead), e.g. use
"tag" or "altq" as action too (equals to "count").

2.5. Just to mention: modip, counter limits, fragments.

These patches are already currently discussed in ipfw@, but included
here just to not forget. These are "modip" action, allowing to modify IP
header (DSCP, ToS, TTL) and corresponding match rule options, and a rule
option to match when rule counters are less then specified number
packets or bytes (possibly from dynamic rule's counters), may be
a tablearg. This is also related with mentioned in section 1.2 ability
to control rule counters.

Adding a few keywords for O_FRAG more fragment matching (not only
non-first fragment), e.g. for sending to specialized netgraph(4)
reassembling module, is also desirable.

That's all for today. Any comments, additions, corrections are welcome!

WBR, Vadim Goncharov. ICQ#166852181       mailto:vadim_nuclight at mail.ru
[Moderator of RU.ANTI-ECOLOGY][FreeBSD][http://antigreen.org][LJ:/nuclight]

More information about the freebsd-hackers mailing list