Strange behaviour

Wed Jan 21 10:40:17 PST 2004

On Wed, Jan 21, 2004 at 10:27:30AM -0800, Kris Kennaway wrote:
> On Wed, Jan 21, 2004 at 12:28:27PM -0500, Robin P. Blanchard wrote:
> > I have one -CURRENT client:
> > CPU: Intel(R) Xeon(TM) CPU 2.40GHz (2392.25-MHz 686-class CPU)
> >   Origin = "GenuineIntel"  Id = 0xf27  Stepping = 7
> >  
> > Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,
> > CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,
> > SSE,SSE2,SS,HTT,TM,PBE>
> >   Hyperthreading: 2 logical CPUs
> > real memory  = 1073610752 (1023 MB)
> > avail memory = 1045266432 (996 MB)
> > ACPI APIC Table: <DELL   PE2650  >
> > FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
> > 
> > which, when installing a new world (via nfs), consistently hangs at the end
> > with:
> > 
> > --------------------------------------------------------------
> > >>> Rebuilding man page indices
> > --------------------------------------------------------------
> > cd /usr/src/share/man; make makedb
> > makewhatis /usr/share/man
> > 
> > 
> > The box is useable at this point, however. I have been simply rebooting the
> > machine, and then running the above commands by hand after the reboot. While
> > 'installworld' is hung (at the end, as above), this is in a 'top':
> > 
> > 19107 root      -4    0   992K   896K getblk 1   0:01  0.00%  0.00%
> > makewhatis
> 
> How long has it been "hung" for?  If you have a slow network you might
> be killing it while it is doing work.
> 
> Do you have rpc.lockd and statd running on both client and server?

I have the same machine (Dell 2650) and it's getting locked up in
a very similar way, you don't need to get NFS involved to have
processes get locked uup in getblk.  I'm slowly trying to remove
variables but so far it seems like network activity of some sort
helps cause the lockup.  The easiest way to make it lock up was
doing backups through the network.  But find's cranked up by the
nightly cron jobs can get locked in getblk as well (while there
are no NFS partitions mounted, but things like cvsup updates of
a local repo are happening).  Once things start to get locked up
like this the system slowly degrades.  I can usually ssh in and
reboot it if I catch it soon enough, if I leave it for a couple
of days it will seem like it's up (rwhod is running) but ssh-ing
in won't work.

sledge (amd64 machine in the cluster) was showing similar symptoms
this morning, it had failed doing its nightly rebuild/reboot and
things like mtree commands were wedged since a day or two ago.

The Dell I have here is not really in production at all, if me
doing anything here will help I'm game...

-- 
						Ken Smith
- From there to here, from here to      |       kensmith at cse.buffalo.edu
  there, funny things are everywhere.   |
                      - Theodore Geisel |