7.1 hangs, shutdown terminated

Fri Oct 10 08:50:46 UTC 2008

On Fri, Oct 10, 2008 at 10:40:01AM +0200, Laszlo Nagy wrote:
>  Hi,
>
> A computer hangs every day in the morning at a specific time, between 8  
> AM and 9 AM. We can ping it. Apparently the console works, also gdm  
> works on it, but we are not able to login at all. ssh accepts  
> connections, but the authentication does not continue (e.g. ssh client  
> waits for the server forever...)
>
> I even cannot login on the console as "root" because it accepts the user  
> name, but does not ask for the password!
>
> Pressing Ctrl+Alt+Del on the console waits for about one or two minutes,  
> then I see this on the screen:
>
> http://www.imghype.com/viewer.php?imgdata=9d95ee9d1fstrange_shutdown.jpg
>
> Here is /var/log/messages just before the crash:
>
> Oct 10 01:52:47 shopzeus postgres[81114]: [5-1] WARNING:  nonstandard  
> use of escape in a string literal at character 193
> Oct 10 01:52:47 shopzeus postgres[81114]: [5-2] HINT:  Use the escape  
> string syntax for escapes, e.g., E'\r\n'.
> Oct 10 01:57:11 shopzeus postgres[84132]: [5-1] WARNING:  nonstandard  
> use of escape in a string literal at character 188
> Oct 10 01:57:11 shopzeus postgres[84132]: [5-2] HINT:  Use the escape  
> string syntax for escapes, e.g., E'\r\n'.
> Oct 10 02:00:01 shopzeus postfix/postfix-script[86167]: fatal: the  
> Postfix mail system is already running
> Oct 10 02:30:00 shopzeus postfix/postfix-script[7240]: fatal: the  
> Postfix mail system is already running
> Oct 10 03:00:00 shopzeus postfix/postfix-script[27437]: fatal: the  
> Postfix mail system is already running
> Oct 10 04:07:54 shopzeus rc.shutdown: 30 second watchdog timeout  
> expired. Shutdown terminated.
> Oct 10 04:09:16 shopzeus postgres[30455]: [5-1] FATAL:  terminating  
> connection due to administrator command
> Oct 10 04:09:17 shopzeus syslogd: exiting on signal 15
> Oct 10 04:11:31 shopzeus syslogd: kernel boot file is /boot/kernel/kernel
> Oct 10 04:11:31 shopzeus kernel: Copyright (c) 1992-2008 The FreeBSD  
> Project.
> Oct 10 04:11:31 shopzeus kernel: Copyright (c) 1979, 1980, 1983, 1986,  
> 1988, 1989, 1991, 1992, 1993, 1994
>
> After rebooting the machine, nothing happens until the next day. Here  
> are some possible problems I can think of:
>
> #1. We are using gjournal. It might be that the journal size is too  
> small. Although I do not think this is the case, because we have 40GB  
> journal space for each journaled partition below (except for /home, it  
> has 10GB only, but /home is rarely used)
>
> Filesystem          1G-blocks Used Avail Capacity  Mounted on
> /dev/da0s1a                 9    1     7    14%    /
> devfs                       0    0     0   100%    /dev
> /dev/da0s1f.journal       140   12   117     9%    /home
> /dev/da0s2d.journal       106    8    89     8%    /pgdata0
> /dev/da0s1d                29    0    26     0%    /tmp
> /dev/da0s2e.journal       585   74   464    14%    /usr
> /dev/da0s1e.journal       145   17   116    13%    /var
> /dev/da1s1d.journal       416    0   383     0%    /data
>
> Is it possible that gjournal is hanging up the machine?
>
> #2. Yesterday when I logged in in the morning, I saw a process running  
> under root, it was something like " find / -sx ..." and then something.  
> I don't remember but it was scanning the whole filesystem. It was using  
> 100% cpu and 100% disk I/O. I wonder if that might be freezing the  
> computer. I do not know how to disable this maintenance process but I  
> should. After killing this process, the system worked fine. (We have  
> zillions of files on the disks, running "find / ..." is a bad idea.)

This could be a periodic job (since you said this happens daily) which
runs early in the morning (2-3am?) and for some reason isn't finishing
in a timely manner.  You haven't provided any actual ps -auxwwwwwww
data, so we can't easily discern if it's a periodic job or something
amiss on your system (for all we know the system could be compromised).

I'm also curious what controller your SCSI disks are attached to.  Can
you provide that information?  dmesg would be useful.  I remember
hearing some reports about 3Ware controllers locking up due to firmware
problems which were later fixed via a f/w upgrade.

> #3. In the screenshot above, you can see that the IMAP server "dovecot"  
> was terminated on signal 11. Can it be the problem? I can't believe that  
> dovecot could freeze the whole system.
>
> #4. Hardware error. I don't think this is the case since the computer  
> freezes at the same time, every day, so it is more likely a software  
> problem.

My vote is on a hardware problem.  The watchdog timeout you see
indicates a portion of the system is locking up hard.  The sig 11 would
indicate a sudden segfault, which if unexpected, often indicates bad
memory or motherboard.

I would recommend you start down the hardware path.  Replace the RAM and
the mainboard, and see what happens.

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |