Server overloaded? Or is it a bug?

Tue Jun 10 06:25:29 PDT 2003

On Thursday 05 June 2003 20:19, Robert Watson wrote:
> So this tells us that interrupt delivery appears to be working fine for
> your NIC, that the network stack isn't completely hosed, and can allocate
> packet buffers (mbufs), so isn't memory-starved at that level of the
> system.

> Sockets are used only for locally terminated connections, and come out of
> a separate memory pool from packet buffers (well, it's a little more
> complicated than that, but that's enough to get the picture).  The reason
> I wondered about this was that one of the classes of possible memory
> starvation is to reach the allocation limit on sockets.  We allocate the
> socket (and TCP state) a couple of packets into the TCP setup, so if the
> TCP setup got partway completed and then there was no further response,
> we'd have a possible explanation.
>
> Since the connection completes, it's probably safe to assume the TCP state
> and socket were fully allocated, and the socket was returned by the kernel
> to the application, or at least, the kernel got pretty much to the point
> of returning it to the application.

> Try using "slogin -v" or "ssh -v" on the client, and paste the results
> into an e-mail in response to this one.  The SSH daemon does a lot of work
> to set up a new connection -- it forks a process or two, does name
> lookups, allocates pseudo-terminals, invokes PAM, and all kinds of other
> things.  There are failure modes for each of these, and a bit more detail
> might let us track it down.  Particularly useful might be the results of
> "slogin -v" both when the machine is operating normally, and when it's
> hosed.  This will let us figure out about when during the process
> something failed, and what it might have been doing.
>
> > >     If you can get partway through the banner but hang later, that
> > > might be the result of a file system deadlock of some sort.
> >
> > This is also possible, but what could have caused it? My file I/O is not
> > really heavy.
>
> Deadlock is a bit of a misnomer for what I have in mind.  There are two
> classes of things that look like deadlocks: lock order problems, and lock
> leaks.

...

> So the VFS deadlock is somewhat of a shot in the dark, but it has pretty
> easy to identify symptoms, especially if you can get to a debugger.
> They're also fairly easy to analyze.

...

> I think we'll find that it's either a kernel problem, or an X problem
> triggering a kernel problem, so we're unlikely to find useful core dumps
> from applications.  A system core might be useful, but hard to get without
> a serial console.
>
> Ok, so at the end of this all, here were my pieces of advice on debugging
> it, if you can reproduce it:
>
> (1) Compare "slogin -v" to the system in the before and after scenarios,
>     that may tell us a lot about what's broken.
>
> (2) Despite the fact that you can't set up a serial console, set up a
>     serial console.

...

Some strange things happened these days, they were all related to processes:

(1) I have some zombies I cannot kill:

# ps ax
...
53410  pn  Z      0:00.00  (kate)
...
# kill -9 53410
53410: No such process

The same thing happens with make.

(2) When I invoke the KDE System Guard, the process list won't show up.

(3) My processes recieve a lot of signals (10 and 11), about 30 times a day.

(4) Kate crashed when I wanted to save a document, and then every time I 
opened it. So I tried gdb kate:

(gdb) run
Starting program: /usr/local/bin/kate
Deprecated bfd_read called at 
/usr/src/gnu/usr.bin/binutils/gdb/../../../../contrib/gdb/gdb/dbxread.c line 
2627 in elfstab_build_psymtabs
Deprecated bfd_read called at 
/usr/src/gnu/usr.bin/binutils/gdb/../../../../contrib/gdb/gdb/dbxread.c line 
933 in fill_symbuf
ERROR: Communication problem with kate, it probably crashed.

Program exited with code 0377.

As I never had any problems like these, I guess they are a side effect of the 
crash.

Do we have a chance to debug this or should I rebuild my system?
And, most imortant, could this be a new kernel bug? If yes, I would really 
like to debug it.

Daniela