regression: msk0 watchdog timeout and interrupt storm
Curtis Villamizar
curtis at ipv6.occnc.com
Wed Jan 1 01:53:26 UTC 2014
I'm getting an interrupt storm from mskc running with the latest
if_msk.c code. The OS is built from source (259540):
FreeBSD 10.0-PRERELEASE (GENERIC) #0 r259540: Sat Dec 21 00:05:39 EST 2013
While not the latest, the point is that sys/dev/msk is up to date wrt
stable_9 and also wrt head.
The odd thing is that the machine seemed to run fine for a day or two
and then started exhibiting this behaviour and has become useless.
This is now highly reproducible (it happens within seconds when trying
to do a long file transfer between two machines with GbE) so if there
is anything I can do to instrument this, please make suggestions.
What I know so far is:
1. When the watchdog occurs, Y2_IS_STAT_BMU is set in the prior
interrupt mask.
2. This would put us in from msk_intr into msk_handle_events, with
msk_handle_events returning 0.
3. msk_handle_events reads in sc->msk_stat_cons. The last recorded
value of sc->msk_stat_cons is alway 1024.
4. The only way to exit msk_handle_events with sc->msk_stat_cons
greater than zero yet not do anything is hit the top of loop
conditional and fall out:
sd = &sc->msk_stat_ring[cons];
control = le32toh(sd->msk_control);
if ((control & HW_OWNER) == 0)
break;
5. The code after the loop can return zero if the ring buffer
pointer hasn't moved. That code is:
sc->msk_stat_cons = cons;
bus_dmamap_sync(sc->msk_stat_tag, sc->msk_stat_map,
BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE);
if (rxput[MSK_PORT_A] > 0)
msk_rxput(sc->msk_if[MSK_PORT_A]);
if (rxput[MSK_PORT_B] > 0)
msk_rxput(sc->msk_if[MSK_PORT_B]);
return (sc->msk_stat_cons != CSR_READ_2(sc, STAT_PUT_IDX));
6. If the return value is zero, the interrupt isn't cleared. That
was suspect. The code in msk_intr is:
domore = msk_handle_events(sc);
if ((status & Y2_IS_STAT_BMU) != 0 && domore == 0)
CSR_WRITE_4(sc, STAT_CTRL, SC_STAT_CLR_IRQ);
7. This code before the return in msk_handle_events should force
the clear but doesn't fix anything.
if ((control & HW_OWNER) == 0)
return;
This looks like some sort of fall off the end of a ring buffer type of
problem (since it always points to entry 0x400) but since I haven't
done driver work in ages, that is mostly just a wild guess and I
really have no idea yet at to what is going wrong.
Also please keep me on the Cc since I'm not subscribed to the list,
though I will check the archives from time to time.
Thanks,
Curtis
reference:
http://lists.freebsd.org/pipermail/freebsd-stable/2013-November/075699.html
More information about the freebsd-stable
mailing list