Panic in 6.2-PRERELEASE with bge on amd64

Tue Jan 9 14:30:56 UTC 2007

On Tue, 2007-01-09 at 12:50 +1100, Bruce Evans wrote:
> On Mon, 8 Jan 2007, Sven Willenberger wrote:
> 
> > On Mon, 2007-01-08 at 16:06 +1100, Bruce Evans wrote:
> >> On Sun, 7 Jan 2007, Sven Willenberger wrote:
> 
> >>> The short and dirty of the dump:
> >>> ...
> >>> --- trap 0xc, rip = 0xffffffff801d5f17, rsp = 0xffffffffb371ab50, rbp = 0xffffffffb371aba0 ---
> >>> bge_rxeof() at bge_rxeof+0x3b7
> >>
> >> What is the instruction here?
> >
> > I will do my best to ferret out the information you need. For the
> > bge_rxeof() at bge_rxeof+0x3b7 line, the instruction is:
> >
> > 0xffffffff801d5f17 <bge_rxeof+951>:     mov    %r15,0x28(%r14)
> > ...
> >> Looks like a null pointer panic anyway.  I guess the instruction is
> >> movl to/from 0x28(%reg) where %reg is a null pointer.
> >>
> >
> > from the above lines, apparently %r14 is null then.
> 
> Yes.  It's a bit suprising that the access is a write.
> 
> >>> ...
> >>> #8  0xffffffff801db818 in bge_intr (xsc=0x0) at /usr/src/sys/dev/bge/if_bge.c:2707
> >>
> >> What is the statement here?  It presumably follow a null pointer and only
> >> the exprssion for the pointer is interesting.  xsc is already null but
> >> that is probably a bug in gdb, or the result of excessive optimization.
> >> Compiling kernels with -O2 has little effect except to break debugging.
> >
> > the block of code from if_bge.c:
> >
> >   2705         if (ifp->if_drv_flags & IFF_DRV_RUNNING) {
> >   2706                 /* Check RX return ring producer/consumer. */
> >   2707                 bge_rxeof(sc);
> >   2708
> >   2709                 /* Check TX ring producer/consumer. */
> >   2710                 bge_txeof(sc);
> >   2711         }
> 
> Oops.  I should have asked for the statment in bge_rxeof().

#7  0xffffffff801d5f17 in bge_rxeof (sc=0xffffffff8836b000) at /usr/src/sys/dev/bge/if_bge.c:2528
2528                    m->m_pkthdr.len = m->m_len = cur_rx->bge_len - ETHER_CRC_LEN;

(where m is defined as:
2449                 struct mbuf             *m = NULL;
)

> 
> > By default -O2 is passed to CC (I don't use any custom make flags other
> > than and only define CPUTYPE in my /etc/make.conf).
> 
> -O2 is unfortunately the default for COPTFLAGS for most arches in
> sys/conf/kern.pre.mk.  All of my machines and most FreeBSD cluster
> machines override this default in /etc/make.conf.
> 
> With the override overridden for RELENG_6 amd64, gcc inlines bge_rxeof(),
> so your environment must be a little different to get even the above
> ifo.  I think gdb can show the correct line numbers but not the call
> frames (since there is no call).  ddb and the kernel stack trace can
> only show the call frames for actual calls.
> 
> With -O1, I couldn't find any instruction similar to the mov to the
> null pointer + 28.  28 is a popular offset in mbufs

If you have a suggestion for an /etc/make.conf line, I can recompile the
kernel accordingly assuming it still panics or locks up after the change
of interface noted below.

> 
> > The short of it is that this interface sees pretty much non-stop traffic
> > as this is a mailserver (final destination) and is constantly being
> > delivered to (direct disk access) and mail being retrieved (remote
> > machine(s) with nfs mounted mail spools. If a momentary down of the
> > interface is enough to completely panic the driver and then the kernel,
> > this hardly seems "robust" if, in fact, this is what is happening. So
> > the question arises as to what would be causing the down/up of the
> > interface; I could start looking at the cable, the switch it's connected
> > to and ... any other ideas? (I don't have watchdog enabled or anything
> > like that, for example).
> 
> I don't think down/up can occur in normal operation, since it takes ioctls
> or a watchdog timeout to do it.  Maybe some ioctls other than a full
> down/up can cause problems... bge_init() is called for the following
> ioctls:
> - mtu changes
> - some near down/up (possibly only these)
> Suspend/resume and of course detach/attach do much the same things as
> down/up.
> 
> BTW, I added some sysctls and found it annoying to have to do down/up
> to make the sysctls take effect.  Sysctls in several other NIC drivers
> require the same, since doing a full reinitialization is easiest.
> Since I am tuning using sysctls, I got used to doing down/up too much.
> 
> Similarly for the mtu ioctl.  I think a full reinitialization is used
> for mtu changes mainly in cases the change switches on/off support for
> jumbo buffers.  Then there is a lot of buffer reallocation to be
> done, and interfaces have to be stopped to ensure that the bufferes
> being deallocated are not in use, etc.
> 
> Bruce

As this was connected to a gigE switch with mtu left at 1500 I supposed
it is possible that perhaps some mtu discovery/change may have been
happening on the switch but that seems a bit out in left field. For now
I am using the fxp interface connected to the same switch to see if the
issue continues (the change of interface was driven by a hard lockup
yesterday where I could not even type anything on the term).

Sven