panic in rt_check_fib()

Sun Sep 14 12:56:53 UTC 2008

On Sat, 13 Sep 2008 23:28:51 -0700, Julian Elischer <julian at elischer.org> wrote:
> To recap on this, I rewrote this function a couple of week sagobecause I
> couldn't keep track of what was going on, and I thought it might
> havesome bad edge cases.  a couple of days later Giorgos contacted me
> saying hta the had a fairly reproducible situation
> where this was triggered and it appeared to be an edge case in
> this function that allowed it to try lock the same lock twice.
>
> I immediatly thought "ah=hah!" I may have a solution to this,
> and gave him a copy of my new function and indead it DOES fix that
> panic. however after deleting and recreating intefaces a few hundred
> times without crashing in rt_check_fib() it then fails somewhere else,
> (actually it leacks some resources and eventually networking stops).
>
> I'm not convinced that is a problem with the new or old rt_check() but
> it did stop me from just committing the new code.
>
> I rereading the way the function (did and still does) work it
> occurred to me that there was a large flaw in teh way it worked..
>
> It dropped a the lock on one route while it went off an did something
> else that might block, On returning it blindly re-grabbed that lock,
> completely ignoring the fact that the route might not even be valid any
> more. (or any of several other things that may have changed while
> it was away (maybe sleeping)).
>
> the code Giorgos is referring to is a patch I suggested to him to
> fix this oversight and not the one that I originally tested and
> had suggested to fix the edge case.
>
> I do however ask that some other people look at this patch!

Exactly.  Thanks for summarizing this so well :)

I have started a kernel with your latest patch (from the quoted message
above), and I can't panic my kernel with the script that did it in a
semi-reliable manner before:

% root at kobe:/root# while true ; do \
%         sh home.sh > /dev/null 2>&1 ; \
%         vmstat -z | sed -n -e 1p -e /rt/p ; \
%         sleep 1 ; \
%     done
% ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
% rtentry:                  120,        0,       19,       77,       43,        0
% ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
% rtentry:                  120,        0,       20,       76,       47,        0
% ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
% rtentry:                  120,        0,       21,       75,       51,        0
% ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
% rtentry:                  120,        0,       23,       73,       55,        0
% ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
% rtentry:                  120,        0,       24,       72,       59,        0
% ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
% rtentry:                  120,        0,       25,       71,       62,        0
% ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
% rtentry:                  120,        0,       26,       70,       65,        0
% ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
% rtentry:                  120,        0,       27,       69,       69,        0
% ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
% rtentry:                  120,        0,       29,       67,       73,        0
% ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
% rtentry:                  120,        0,       30,       66,       76,        0
% ^C
% root at kobe:/root# sh home.sh

rtentries seem to be going up every time I cycle through the script,
which essentially brings down both wireless and wired interfaces and
then brings up the wired interface of my laptop.  The core of the script
is currently:

  # network interface options
  export ifconfig_re0="inet 192.168.1.10/24"
  export defaultrouter='192.168.1.1'

  echo '## Stopping network interfaces.'
  /etc/rc.d/netif stop re0  && ifconfig re0  delete
  /etc/rc.d/netif stop iwn0 && ifconfig iwn0 delete

  echo '## Bringing up network interface.'
  /etc/rc.d/netif start re0

  echo "## Reloading firewall rules."
  /etc/rc.d/pf reload

  # The default route may be pointing to another interface.  Find out
  # the IP address of the default gateway, delete it and point to the
  # default gateway configured as ${defaultrouter}.
  if [ -n "${defaultrouter}" ]; then
          echo '## Setting default router.'
          _oldrouter=`netstat -rn | grep default | awk '{print $2}'`
          if [ -n "${_oldrouter}" ]; then
                  route delete default "${_oldrouter}"
                  unset _oldrouter
          fi
          route add default "$defaultrouter"
  fi

With your version of rt_check_fib() I have no panics so far.  This
doesn't mean we don't have a bug elsewhere, or that it will not panic
tomorrow, but it's nice that thing seem a bit more stable now.  The old
version of rt_check_fib() used to panic about one third of the time I
ran my 'home.sh' script...

Now an interesting question is: Is it `normal' that the USED rtentry
objects keep going up at every interface restart and are (at least at
first glance) not reclaimed as fast as they are acquired?