regression: msk0 watchdog timeout and interrupt storm

Curtis Villamizar curtis at ipv6.occnc.com
Wed Jan 1 01:53:26 UTC 2014


I'm getting an interrupt storm from mskc running with the latest
if_msk.c code.  The OS is built from source (259540):

FreeBSD 10.0-PRERELEASE (GENERIC) #0 r259540: Sat Dec 21 00:05:39 EST 2013

While not the latest, the point is that sys/dev/msk is up to date wrt
stable_9 and also wrt head.

The odd thing is that the machine seemed to run fine for a day or two
and then started exhibiting this behaviour and has become useless.

This is now highly reproducible (it happens within seconds when trying
to do a long file transfer between two machines with GbE) so if there
is anything I can do to instrument this, please make suggestions.

What I know so far is:

  1.  When the watchdog occurs, Y2_IS_STAT_BMU is set in the prior
      interrupt mask.

  2.  This would put us in from msk_intr into msk_handle_events, with
      msk_handle_events returning 0.

  3.  msk_handle_events reads in sc->msk_stat_cons.  The last recorded
      value of sc->msk_stat_cons is alway 1024.

  4.  The only way to exit msk_handle_events with sc->msk_stat_cons
      greater than zero yet not do anything is hit the top of loop
      conditional and fall out:

      sd = &sc->msk_stat_ring[cons];
      control = le32toh(sd->msk_control);
      if ((control & HW_OWNER) == 0)
          break;

  5.  The code after the loop can return zero if the ring buffer
      pointer hasn't moved.  That code is:

      sc->msk_stat_cons = cons;
      bus_dmamap_sync(sc->msk_stat_tag, sc->msk_stat_map,
          BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE);

      if (rxput[MSK_PORT_A] > 0)
              msk_rxput(sc->msk_if[MSK_PORT_A]);
      if (rxput[MSK_PORT_B] > 0)
              msk_rxput(sc->msk_if[MSK_PORT_B]);

      return (sc->msk_stat_cons != CSR_READ_2(sc, STAT_PUT_IDX));

  6.  If the return value is zero, the interrupt isn't cleared.  That
      was suspect.  The code in msk_intr is:

      domore = msk_handle_events(sc);
      if ((status & Y2_IS_STAT_BMU) != 0 && domore == 0)
              CSR_WRITE_4(sc, STAT_CTRL, SC_STAT_CLR_IRQ);

  7.  This code before the return in msk_handle_events should force
      the clear but doesn't fix anything.

      if ((control & HW_OWNER) == 0)
              return;

This looks like some sort of fall off the end of a ring buffer type of
problem (since it always points to entry 0x400) but since I haven't
done driver work in ages, that is mostly just a wild guess and I
really have no idea yet at to what is going wrong.

Also please keep me on the Cc since I'm not subscribed to the list,
though I will check the archives from time to time.

Thanks,

Curtis


reference:
http://lists.freebsd.org/pipermail/freebsd-stable/2013-November/075699.html


More information about the freebsd-stable mailing list