resource leak

Wed Oct 1 20:36:51 UTC 2008

Robert Watson wrote:
> On Wed, 1 Oct 2008, Gary Palmer wrote:
> 
>>> Periodically logging "ps -auxw" output to a file would be useful, as 
>>> ideally you'd gradually see the list get longer and longer over time; 
>>> it's possible you have many zombie processes as a result of a parent 
>>> which is not reaping its children (calling waitpid(2) or its friends).
>>
>> "ps alxw" may be of interest in addition to "ps auxw" as it displays 
>> what the processes are waiting on.  It could conceivably be a problem 
>> of some kind at the filesystem level.  I've seen situations before 
>> where a problem escalates to the point where "ls /" hangs, and at that 
>> point you're stuck with an unresponsive box.
> 
> If you want an even greater level of detail than ps -l, you can use 
> procstat -k to generate kernel stack traces for all user/kernel 
> threads.  Wait channels are very useful, but they only tell you what the 
> code that invoked the wait thinks it is for, not how that code was 
> reached.  A classic example is waiting on an exhausted UMA zone -- you 
> get a uma wait channel, but no indication of what subsystem performed 
> the memory allocation...  This required FreeBSD 7.1 and higher, 
> however.  (Obviously, the same can be done easily using DDB, but that's 
> hard on a box without a serial console, and requires interrupting the 
> flow of the operating system, compiling with DDB, etc).
> 
> Robert N M Watson
> Computer Laboratory
> University of Cambridge
> 
A big part of problem is this seems to take about 100 days of uptime to occur. 
We have some inhouse test boxes but have never seen the problem, probably 
because non of them have been up more than about 45 days. The units in the 
field, of which there is about 300, are headless and none are physically close.

When the boxes are rebooted there are no error messages in any of the log files,
only the absence of information that would normally be logged by new processes 
that would be spawned. We are getting ready to install a patch that will try to
gather more information.

I thought about writing an app the would try to fork a child periodically and 
record in a log file if there was an error. But EAGAIN is nonspecific as to the 
real reason the fork failed. I was looking for some way to periodically log the
resources that would cause the fork failure.

procstat -k looks like it would have been a good candidate but unfortunately we
are running 6.1.

Thanks for the response.
Steve

-- 

"They that give up essential liberty to obtain temporary safety,
deserve neither liberty nor safety."  (Ben Franklin)

"The course of history shows that as a government grows, liberty
decreases."  (Thomas Jefferson)