HEADS UP: netipx mega-MFC (1/2)

Sat Feb 26 10:41:28 GMT 2005

On Sat, 26 Feb 2005, Bob Johnson wrote:

> I'm wasn't planning to be near that system again until Monday, although
> if you need details about my configuration I can dig them out remotely
> and send them to you.  I'm not brave enough to risk another panic when
> I'm not physically there to reset the system, but if you are willing to
> work on it this weekend I'm willing to drive over there and test it.  I
> would REALLY like to see the Netware stuff working in 5.4R. 
> 
> A few weeks ago I had the same panic when I tried to set up Netware
> support on 5.3-Release.  It went away when I updated to -stable then (I
> believe it was around Jan 24), but the packets going out on the wire
> were not quite right, so I couldn't actually use it for networking. 

So to confirm what I think you're saying:

- 5.3-R panicked configuring ipx/ncp/nwfs against a Netware server.
- 5.3-S up until my recent changes didn't panic, but there appeared to be
  on-the-wire corruption.
- 5.3-S (5.4-P) as of yesterday now panics again.

So it sounds like were still dealing with at least two problems: some sort
of panic, and an on-the-wire problem.  I think the first course of
business is to get the panic fixed -- chances are, it's a pointer botch of
some sort, if you're seeing a fault.  Here are some things that you could
do to help me debug these problems: 

(1) Compile your kernel with DDB/KDB, and configure a dump partition.
    Make sure you have a kernel with debugging symbols on-hand.  If you've
    not done this before, instructions can be found in the handbook.  Many
    bugs can be debugged using just DDB/KDB, but a dump for post-mortem
    analysis can be quite helpful for more complex bugs.  Even if we don't
    manage to get kernel dumps, we'll need the kernel with debugging
    symbols to convert addresses into lines of source code.

(2) When reporting a panic, please report the exact steps it took to get
    to the panic.  I can imagine a number of bugs we might have that might
    trigger at different points, and I'm not currently clear on which it
    is.  For example, the IPX code might panic when ifconfig runs to
    configure an address, or the panic might happen at file system mount
    time when you call mount_nwfs.  Or does the panic happen later
    on first file access, or after some period of activity?  Knowing which
    of these it is would be very helpful in narrowing down the source of
    the problem.

(3) When reporting a panic, it's helpful to have as much of the trap or
    panic output as possible.  I don't know if you're currently using a
    serial console or not, but I find that a serial console is very
    helpful in gathering debugging information, as it makes it easy to
    copy and paste output.  If you get into DDB following the panic, the
    commands "show pcpu", "ps", and "trace" are almost always good
    starting points for debugging.  With a serial console, sending that
    output by e-mail will be dramatically easier :-).  When not running
    with a serial console, many people will use digital cameras to take
    pictures of debugger output, because that's till more convenient than
    trying to write it down or type it in (lots of hex digits :-). 

I'll be pretty available this weekend to help with debugging this.  Not
sure if you've done it yet, but it might be useful to boot the previous
kernel and just make sure that the panic only happens with the new kernel,
and that it wasn't triggered by some other change in your environment. 
That seems fairly unlikely, but it's good to check assumptions because it
can save a lot of time and confusion :-). 

Robert N M Watson