zonelimit issues...

Mon Apr 21 07:44:24 UTC 2008

At Sun, 20 Apr 2008 09:53:49 -0700,
Chris Pratt wrote:
> 
> 
> On Apr 20, 2008, at 2:43 AM, Robert Watson wrote:
> 
> >
> > On Fri, 18 Apr 2008, Chris Pratt wrote:
> >
> >> Doesn't 7.0 fix this? I'd like to see an official definitive  
> >> answer and all I've been going on is that the problem description  
> >> is no longer in the errata.
> >
> > Unfortunately, bugs of this sort don't really "work" that way --  
> > specific bugs are a property of a problem in code (or a problem in  
> > design), but what we have right now is a report of a symptom that  
> > might reflect zero or more specific bugs.  It's unclear that the  
> > problem described in errata is the problem you've been  
> > experiencing, or that the (at least one) fixed bug with the same  
> > symptoms is that one you've been experiencing.  For better or  
> > worse, the only way to really tell of a generic class of hang or  
> > wedging is fixed is to try out the new version and see.  In most  
> > cases, "zonelimit" wedging reflects one of two things:
> >
> > (1) Inadequate resource allocation to the network stack or some other
> >     component, try tuning up the memory tunable for clusters (for  
> > example).
> >
> For several months I did quite a bit of tuning. I never increased
> nmbclusters beyond the 32768 shown in the docs because man
> tuning doesn't define it's use of "arbitrarily high". Inability to boot
> could mean travel. Kris Kenneway had provided instructions to
> get a dump. I set up for that but have never had a dump. The
> only respite came from adding another circuit, another NIC and
> spreading traffic. We increased our lock time from every couple
> of days during the heavy bot period of late 2006 to now every
> month or during traditionally slow months, even two months.
> For example, we ran a record 72 days last summer. It was a
> very dead summer traffic wise.
> 
> I will try to increase the nmbclusters dramatically if I can figure
> out what a safe top limit is but it sounds like the jump to
> 7.0 RELEASE may be worth the effort. I would want to wait
> until this issue with TCP, Windows and certain routers is well
> past. I had not seen that applied to 7_0_0 yet and that would be
> a show stopper. Is there a way to know what is safe for
> nmbclusters given an 8GB ram system?

On "big" systems I am currently using 65000, and that seems safe so
far.  This is on an 8 core (2P) Xeon box with 8G of RAM.

> I did vmstats data collection for a couple of months when things
> were at their worst. The results were nebulous to me based
> on lack of code knowledge. All I actually found was that a
> certain counter would drop to 0 and never recover. I didn't
> know if it was meaningful and received no replies when I
> asked FreeBSD-Questions. It was 128-Bucket or something
> like that.
> 
> > (2) A memory leak in a network device driver or other network part,  
> > which
> >     needs to be debugged and fixed.
> >
> 
> Initially I thought there may be something related to the bge
> driver and moved the high traffic apps on an em. This didn't
> seem to help much, nor did polling.
> 
> I am most willing to collect data if I could figure out how to
> collect something meaningful. I gather from what you say,
> that 7.0 would provide this.
> 
> I really appreciate both of your responses. Just based on
> this one problem, 6.x has been a bad experience after
> years of seemingly impossible uptime on 4 and 5.x
> FreeBSD.

Well there are plenty of us motivated to get at these issues.  Can you
do me a favor and characterize your traffic a bit?  Is it mostly TCP,
or heavily UDP or some sort of mix?  The issues I see are UDP based,
which is less surprising as UDP has no backpressure and it is easy to
over commit the system by upping the socket buffer space allocated
without upping the number of clusters to compensate.

Best,
George