Bash lockups

Fri May 28 18:28:30 UTC 2010

Giorgos Keramidas <keramida at ceid.upatras.gr> writes:

> On Fri, 21 May 2010 09:30:05 -0700, Carl Johnson <carlj at peak.org> wrote:
>> Giorgos Keramidas <keramida at ceid.upatras.gr> writes:
>>> Does this lock-up happen if you leave the shell 'idle' for too long
>>> over an ssh session?  There may be problems with stateful connection
>>> tracking between your terminal and the remote shell :-/
>>
>> No, I don't think that could be the problem.  I am just using ssh
>> between local machines and there is no firewall between them.  It also
>> often seems to happen to a shell as I switch away from it to another
>> one.  One suspicion is that something is sending a signal to the shell
>> as it switches, and bash sometimes doesn't handle that signal
>> properly.
>>
>> I also should have mentioned that I have been running bash as my
>> default shell for years under Linux and have never seen this problem
>> there.
>>
>> Thanks for the suggestion.
>
> That's ok.  If you can attach to the bash process with ktrace please try
> to grab a ktrace file from a deadlocked shell.  We may be able to see
> why it gets deadlocked by running kdump(8) on the shell trace file.
>
> You can run a second shell under ktrace (and hope that the parent
> doesn't deadlock before the traced child shell), by running:
>
>     bash$ ktrace -f bash.trace bash --login
>
> When you exit from the child shell you can dump ktrace(8) events from
> the bash.trace file with:
>
>     bash$ kdump -f bash.trace > logfile 2>&1
>
> Looking near the last records dumped in 'logfile' should be quite
> informative if the process is dead-locked or spinning around the same
> code over and over again.

I finally got one after starting ktrace a few days ago.  It is
informative, but it raises as many questions as it answers.  It
basically just wrote out the prompt, *started* to setup for reading
the input and just stopped.  I ran gdb on it and it is stuck looping
somewhere in getenv.  I don't have the system compiled with debugging,
so I have limited information on what it is doing there.  I checked
multiple times, and I also saw getenv running routines such as memset,
strlen, mbrtowc, and wcsnrtombs.

The following is the tail end of the 'kdump -Ef' output:
  67263 bash     61412.013860 GIO   fd 2 wrote 28 bytes
       0x0000 0d0f 1b5b 316d 5b63 6172 6c6a 4063 6a62 7364 3874 207e 5d24 1b5b  |...[1m[carlj at cjbsd8t ~]$.[|
       0x001a 6d20                                                              |m |

  67263 bash     61412.013867 RET   write 28/0x1c
  67263 bash     61412.013874 CALL  sigprocmask(SIG_SETMASK,0x80e133c,0)
  67263 bash     61412.013880 RET   sigprocmask 0

and the following is the similar section of a normal prompt:
  67263 bash     61403.461469 GIO   fd 2 wrote 27 bytes
       0x0000 0f1b 5b31 6d5b 6361 726c 6a40 636a 6273 6438 7420 7e5d 241b 5b6d  |..[1m[carlj at cjbsd8t ~]$.[m|
       0x001a 20                                                                | |
  67263 bash     61403.461476 RET   write 27/0x1b
  67263 bash     61403.461483 CALL  sigprocmask(SIG_SETMASK,0x80e133c,0)
  67263 bash     61403.461489 RET   sigprocmask 0
  67263 bash     61403.461497 CALL  sigprocmask(SIG_BLOCK,0,0x80e1e3c)
  67263 bash     61403.461504 RET   sigprocmask 0
  67263 bash     61403.461513 CALL  read(0,0xbfbfd95f,0x1)

I just realized there is an extra CR at the beginning of that prompt
(28 bytes instead of 27) that I don't see elsewhere, but nothing else
before that looks different.  This one is an i368 8.0 release, but I
also have another hung shell in a amd64 7.3 release system in
VirtualBox.  I just checked my other ktrace logs and I found one
other place where that extra CR occurs, but there is no lockup there
and that was my other system.

The following is a section of a backtrace from gdb: 
  #0  0x28308540 in mbrtowc () from /lib/libc.so.7
  #1  0x080c7ce6 in getenv ()
  #2  0x080c1335 in getenv ()
  #3  0x080ae1d4 in getenv ()
  #4  0x080ac4b0 in getenv ()
  #5  0x080ac815 in getenv ()
  #6  0x080c3955 in getenv ()
  #7  0x080c3ac9 in getenv ()
  #8  0x080ac4b0 in getenv ()
  #9  0x080ac815 in getenv ()
  #10 0x080acb6c in getenv ()
  #11 0x080acf55 in getenv ()
  #12 0x08054611 in ?? ()
  #13 0x284a9a80 in ?? ()
  ...
  #67 0x2832cbfd in time () from /lib/libc.so.7

The first few entries change when I let it run for a while, but the
last 8-9 getenv addresses and everything before them remain the same.
There are a total of about 65 backtrace entries this time, some of
which are 0x00000000 addresses which seem suspicious.  The backtrace
from the other hung shell is also in getenv, but I didn't have ktrace
running on that one.

I am at the limit of my experience, so does anybody else have any
ideas about what could cause this, or how I could trace it further?  I
am keeping the processes attached to gdb, so I can do further checking
on them if anyone has any other ideas.  Thanks in advance for any
help, and thanks for the help that allowed me to get this far.

-- 
Carl Johnson		carlj at peak.org