arcmsr crash
Matt Reimer
mattjreimer at gmail.com
Fri Jul 13 20:46:24 UTC 2007
On 7/13/07, Scott Long <scottl at samsco.org> wrote:
> John Baldwin wrote:
> > On Tuesday 05 June 2007 05:22:38 pm Matt Reimer wrote:
> >> Once a week or so we're seeing a panic with a -current kernel built
> >> just before the gcc 4.2 import (maybe three weeks ago). The box has a
> >> Supermicro X7DBE/X7DBE+ motherboard with two Xeon 5160s, 16G RAM, and
> >> an Areca 1220 controller with eight 500G disks connected.
> >>
> >> Does this indicate that the arcmsr driver is at fault:
> >>
> >> Tracing command irq16: arcmsr0 pid 26 tid 100018 td 0xffffff040fc5b000
> >> cpustop_handler() at cpustop_handler+0x35
> >> ipi_nmi_handler() at ipi_nmi_handler+0x2e
> >> trap() at trap+0x365
> >> nmi_calltrap() at nmi_calltrap+0x8
> >> --- trap 0x13, rip = 0xffffffff8041ab11, rsp = 0xffffffffab59eff0, rbp
> >> = 0xffffffffac0a37d0 ---
> >> siocnclose() at siocnclose+0x21
> >> sio_cnputc() at sio_cnputc+0x89
> >> cnputc() at cnputc+0x6a
> >> putchar() at putchar+0x5f
> >> kvprintf() at kvprintf+0xd45
> >> printf() at printf+0xe1
> >> panic() at panic+0x145
> >> xpt_done() at xpt_done+0x14a
> >> arcmsr_interrupt() at arcmsr_interrupt+0x2df
> >> ithread_loop() at ithread_loop+0x108
> >> fork_exit() at fork_exit+0xaa
> >> fork_trampoline() at fork_trampoline+0xe
> >> --- trap 0, rip = 0, rsp = 0xffffffffac0a3d30, rbp = 0 ---
> >
> > Looks like it has panic'd here:
> >
> > switch (done_ccb->ccb_h.path->periph->type) {
> > case CAM_PERIPH_BIO:
> > mtx_lock(&cam_bioq_lock);
> > TAILQ_INSERT_TAIL(&cam_bioq, &done_ccb->ccb_h,
> > sim_links.tqe);
> > done_ccb->ccb_h.pinfo.index = CAM_DONEQ_INDEX;
> > mtx_unlock(&cam_bioq_lock);
> > swi_sched(cambio_ih, 0);
> > break;
> > default:
> > panic("unknown periph type %d",
> > done_ccb->ccb_h.path->periph->type);
> > }
> >
> > which should seem to indicate that, yes, it is a driver bug.
> >
>
> The doneq has gotten corrupted somehow. The only real way that this
> could happen is if xpt_done() was called twice on the same ccb. Whether
> this is a hardware bug (hardware completing the same command twice) or
> a driver bug is unknown. I'll try to add some seatbelts to CAM to
> detect this kind of condition. But yes, it's ultimately something in
> the arcmsr subsystem that is at fault.
Do you have any suggestions of instrumentation printfs I could add to
zero in on what part of the driver is at fault?
Matt
More information about the freebsd-current
mailing list