ws at au.dyndns.ws
Thu Oct 2 07:40:55 UTC 2008
On Wed, 2008-10-01 at 16:36 -0400, Stephen Clark wrote:
> Robert Watson wrote:
> > On Wed, 1 Oct 2008, Gary Palmer wrote:
> >> "ps alxw" may be of interest in addition to "ps auxw" as it displays
> >> what the processes are waiting on. It could conceivably be a problem
> >> of some kind at the filesystem level. I've seen situations before
> >> where a problem escalates to the point where "ls /" hangs, and at that
> >> point you're stuck with an unresponsive box.
> > If you want an even greater level of detail than ps -l, you can use
> > procstat -k to generate kernel stack traces for all user/kernel
> > threads. Wait channels are very useful, but they only tell you what the
> > code that invoked the wait thinks it is for, not how that code was
> > reached. A classic example is waiting on an exhausted UMA zone -- you
> > get a uma wait channel, but no indication of what subsystem performed
> > the memory allocation... This required FreeBSD 7.1 and higher,
> > however. (Obviously, the same can be done easily using DDB, but that's
> > hard on a box without a serial console, and requires interrupting the
> > flow of the operating system, compiling with DDB, etc).
> > Robert N M Watson
> > Computer Laboratory
> > University of Cambridge
> A big part of problem is this seems to take about 100 days of uptime to occur.
> We have some inhouse test boxes but have never seen the problem, probably
> because non of them have been up more than about 45 days. The units in the
> field, of which there is about 300, are headless and none are physically close.
> When the boxes are rebooted there are no error messages in any of the log files,
> only the absence of information that would normally be logged by new processes
> that would be spawned. We are getting ready to install a patch that will try to
> gather more information.
> I thought about writing an app the would try to fork a child periodically and
> record in a log file if there was an error. But EAGAIN is nonspecific as to the
> real reason the fork failed. I was looking for some way to periodically log the
> resources that would cause the fork failure.
> procstat -k looks like it would have been a good candidate but unfortunately we
> are running 6.1.
> Thanks for the response.
I have a VIA EPIA-based system that was rebooting and not leaving behind
any diagnosable evidence that I could find. Attaching a serial console
revealed a kernel-trap which was double-faulting as it went to write the
kernel dump. I've not yet had the opportunity to investigate further
except that out of desperation I threw in an additional 64M of RAM - all
I had to hand - adding to its 256M and I haven't seen it fault again in
the 37 days since (it would often stay up for less than a day before
I wonder whether it would be worth your while running a bench unit with
limited RAM, either physically or via the hw.physmem tunable. I would
probably try to identify the amount of RAM that just allows it to run
"normally", ideally subjecting it to a typical workload if possible. If
it bombs after running for a reasonable length of time, add back a
fraction of the unused memory and see if it then stays up proportionally
longer which could be indicative of a memory starvation issue.
If you can get it to bomb in the above scenario then you can probably
get some insight into where it's failing. Having said that, I should
point out that I've not actually used the above technique so I may well
be overlooking something which might prevent it from being useful or
indeed from "working" at all.
More information about the freebsd-stable