resource leak

Thu Oct 2 07:40:55 UTC 2008

On Wed, 2008-10-01 at 16:36 -0400, Stephen Clark wrote: 
> Robert Watson wrote:
> > On Wed, 1 Oct 2008, Gary Palmer wrote:
> > 
> >> "ps alxw" may be of interest in addition to "ps auxw" as it displays 
> >> what the processes are waiting on.  It could conceivably be a problem 
> >> of some kind at the filesystem level.  I've seen situations before 
> >> where a problem escalates to the point where "ls /" hangs, and at that 
> >> point you're stuck with an unresponsive box.
> > 
> > If you want an even greater level of detail than ps -l, you can use 
> > procstat -k to generate kernel stack traces for all user/kernel 
> > threads.  Wait channels are very useful, but they only tell you what the 
> > code that invoked the wait thinks it is for, not how that code was 
> > reached.  A classic example is waiting on an exhausted UMA zone -- you 
> > get a uma wait channel, but no indication of what subsystem performed 
> > the memory allocation...  This required FreeBSD 7.1 and higher, 
> > however.  (Obviously, the same can be done easily using DDB, but that's 
> > hard on a box without a serial console, and requires interrupting the 
> > flow of the operating system, compiling with DDB, etc).
> > 
> > Robert N M Watson
> > Computer Laboratory
> > University of Cambridge
> > 
> A big part of problem is this seems to take about 100 days of uptime to occur. 
> We have some inhouse test boxes but have never seen the problem, probably 
> because non of them have been up more than about 45 days. The units in the 
> field, of which there is about 300, are headless and none are physically close.
> 
> When the boxes are rebooted there are no error messages in any of the log files,
> only the absence of information that would normally be logged by new processes 
> that would be spawned. We are getting ready to install a patch that will try to
> gather more information.
> 
> I thought about writing an app the would try to fork a child periodically and 
> record in a log file if there was an error. But EAGAIN is nonspecific as to the 
> real reason the fork failed. I was looking for some way to periodically log the
> resources that would cause the fork failure.
> 
> procstat -k looks like it would have been a good candidate but unfortunately we
> are running 6.1.
> 
> Thanks for the response.
> Steve

I have a VIA EPIA-based system that was rebooting and not leaving behind
any diagnosable evidence that I could find. Attaching a serial console
revealed a kernel-trap which was double-faulting as it went to write the
kernel dump. I've not yet had the opportunity to investigate further
except that out of desperation I threw in an additional 64M of RAM - all
I had to hand - adding to its 256M and I haven't seen it fault again in
the 37 days since (it would often stay up for less than a day before
that).

I wonder whether it would be worth your while running a bench unit with
limited RAM, either physically or via the hw.physmem tunable. I would
probably try to identify the amount of RAM that just allows it to run
"normally", ideally subjecting it to a typical workload if possible. If
it bombs after running for a reasonable length of time, add back a
fraction of the unused memory and see if it then stays up proportionally
longer which could be indicative of a memory starvation issue.

If you can get it to bomb in the above scenario then you can probably
get some insight into where it's failing. Having said that, I should
point out that I've not actually used the above technique so I may well
be overlooking something which might prevent it from being useful or
indeed from "working" at all.

Wayne