8-STABLE freezes on UDP traffic (DNS), 7.x doesn't

Pyun YongHyeon pyunyh at gmail.com
Mon Mar 29 19:42:44 UTC 2010


On Mon, Mar 29, 2010 at 09:21:42PM +0200, Attila Nagy wrote:
> Pyun YongHyeon wrote:
> > On Mon, Mar 29, 2010 at 12:57:59PM +0200, Attila Nagy wrote:
> >   
> >> Hi,
> >>
> >> Michael Loftis wrote:
> >>     
> >>> --On Thursday, March 25, 2010 3:22 PM +0100 Attila Nagy <bra at fsn.hu>
> >>> wrote:
> >>>
> >>> <...>
> >>>       
> >>>> Both unbound and python accepts DNS requests, and it seems when 25%
> >>>> interrupt happens, only unbound is in *udp state, where it is 50%, both
> >>>> programs are in that state.
> >>>>         
> >>> Try turning of hardware TSO/checksum offload if it's availble on your
> >>> chipset?  ifconfig <interface> -rxcsum -txcsum -tso -- I'm only using
> >>> nfe chips right now, but w/ the TSO/CSUM on they lock up constantly
> >>> under high load.  We're pretty sure it's mostly the nfe driver, or the
> >>> chips themselves, but have never ruled out some generic 8.x hardware
> >>> offload issues.
> >>>       
> >> Bingo, this solved the problem. The current uptime nears four days.
> >> Previously I couldn't go further than a day.
> >>
> >> The machine gets very light TCP load (and other machines which get work
> >> well), so I guess it's UDP RX or TX checksum related.
> >>
> >>     
> >
> > Hmm, this is unexpected result. Since you're using UDP, TSO is not
> > involved in this issue. Because you disabled RX/TX checksum
> > offloading could you check how many number of 'bad checksum' and
> > and 'no checksum' you have from netstat(1)?
> > To narrow down which side of checksum offloading causes the issue,
> > would you just disable one side in a time? For instance, disable TX
> > checksum offloading with RX checksum offloading enabled and see how
> > bce(4) works.
> > #ifconfig bce0 -txcsum rxcsum
> > If that shows the same issue, try disabling RX checksum offloading
> > but enabling TX checksum offloading.
> > #ifconfig bce0 txcsum -rxcsum
> >   
> It's interesting. During the day, I've disabled only HW checksumming and
> left TSO enabled. It couldn't run more than a few hours.
> I have disabled tso again to see what happens.
> 
> BTW, of course there is TCP traffic on that interface (DNS is also
> available on TCP), maybe this causes the problem.

The only guess I can think of at this moment is incorrect use of
bus_dma(9) in TX path. But I'm not sure this is related with the
issue you're seeing. Would you try the experimental patch at the
following URL?
http://people.freebsd.org/~yongari/bce/bce.20100305.diff
Please make sure to back up your old bce(4) driver before applying
the patch. I didn't see any abnormal things in testing but it
wasn't much stressed.


More information about the freebsd-stable mailing list