resource leak

Wed Oct 1 13:35:06 UTC 2008

Jeremy Chadwick wrote:
> On Wed, Oct 01, 2008 at 08:30:26AM -0400, Stephen Clark wrote:
>> Jeremy Chadwick wrote:
>>> On Wed, Oct 01, 2008 at 07:41:56AM -0400, Stephen Clark wrote:
>>>> Hello List,
>>>>
>>>> I am running into a strange problem that points to a resource leak. 
>>>> The problem manifests itself after one of our remote systems has been 
>>>> up around 100 days.
>>>> The symptom is that it appears no new processes can be spawned. If I try to
>>>> ssh to the unit, I can see the 3-way tcp handshake and then no more traffic.
>>>> Examining log files, like cron, etc show that when this happens no more entries
>>>> are written into the cron log. The unit is acting as a firewall, 
>>>> router and vpn appliance these functions continue to work. We have a 
>>>> C application that is periodically started out of a shell script that 
>>>> reports various information about the system, it stops reporting, 
>>>> while vpns, ospf routing, and ipfilter firewalling continue to work 
>>>> and write into their logfiles.
>>>>
>>>> My question is how do I monitor the various resources in the system that could
>>>> prevent the spawning of a new process?
>>> Periodically logging "ps -auxw" output to a file would be useful, as
>>> ideally you'd gradually see the list get longer and longer over time;
>>> it's possible you have many zombie processes as a result of a parent
>>> which is not reaping its children (calling waitpid(2) or its friends).
>>>
>>> Other things that might come in useful are "fstat" and "vmstat -s".
>>>
>>> It sounds like your C program relies heavily on system() or execl() and
>>> fork(), which is why it's affected -- while the other programs are
>>> likely kernel-level.
>>>
>> Thanks Jeremy,
>>
>> I have added those commands to a periodic daily script.
>>
>> Another thing I have noticed is that quite often the problem seems to
>> start at 2am in the morning, right when the periodic daily script runs.
>>
>> But I think it is coincidence and that we have reached the edge of the 
>> resource limit and all the jobs that get spawned by the periodic daily 
>> scripts pushes us over the limit.
>>
>> The other thing is that having logged into some of the systems that have 
>> been up in the 80 day range, I don't see a lot/any zombies. I just wonder 
>> if it is and fd leak, the fstat should point that out.
> 
> You might find the below thread beneficial -- an individual came to the
> lists stating that they were running out of fds as a result of some
> Java software running amok on their systems.
> 
> http://lists.freebsd.org/pipermail/freebsd-stable/2008-September/thread.html#45383
> http://lists.freebsd.org/pipermail/freebsd-stable/2008-September/045383.html
> 
Thanks, but after reading the thread is there a single place in the kernel that 
reports the how many fds are currently in use? Does the "no more fds" message 
get logged in /var/log/messages or only in the kernel log buffer, since I 
haven't seen that message in the messages file, and since we force to have a 
remote user reboot the box the kernel buffer is gone.

Steve

-- 

"They that give up essential liberty to obtain temporary safety,
deserve neither liberty nor safety."  (Ben Franklin)

"The course of history shows that as a government grows, liberty
decreases."  (Thomas Jefferson)