4.11-RC3: SCSI+UFS+softupdates corruption (write cache DISABLED!)

Wed Jan 19 05:49:44 PST 2005

Hi,

I had a FreeBSD 4.11-RC3 machine reboot without advance notice, the last
logging the network syslogd captured was attempted aic0 (Adaptec 2940 UW
Pro) recovery.

Syslog excerpt as captured by the remote machine, with date and
"hostname /kernel:" and card state dumps removed (can be provided if
necessary). I wonder if the SCSI error recovery attempts caused the
reboot, I have no hints either way, but this machine is otherwise
stable.

13:28:35 ahc0: Recovery Initiated
13:28:53 (da0:ahc0:0:0:0): SCB 0x16 - timed out
13:28:53 sg[0] - Addr 0x6da3800 : Length 2048
13:28:53 (da0:ahc0:0:0:0): Other SCB Timeout
13:28:53 ahc0: Timedout SCBs already complete. Interrupts may not be functioning.
13:28:53 ahc0: Recovery Initiated
13:29:02 (da0:ahc0:0:0:0): SCB 0x1b - timed out

13:29:04 (da0:ahc0:0:0:0): BDR message in message buffer
13:29:04 ahc0: Timedout SCBs already complete. Interrupts may not be functioning.
13:29:04 ahc0: Recovery Initiated

13:29:16 Kernel Free SCB list: 9 4 15 20 
13:29:17 sg[7] - Addr 0x3bea000 : Length 4096
13:29:18 ahc0: Issued Channel A Bus Reset. 25 SCBs aborted

As the machine rebooted up, it remained in single user due to
a softupdates inconsistency fsck reported:

| # fsck -p /usr
| /dev/da0s1g: DIRECTORY CORRUPTED  I=175105  OWNER=root MODE=40755
| /dev/da0s1g: SIZE=512 MTIME=Jan 18 15:14 2005 
| /dev/da0s1g: DIR=?
| 
| /dev/da0s1g: UNEXPECTED SOFT UPDATE INCONSISTENCY; RUN fsck MANUALLY.

I have not yet run fsck for interactive repair, because I want to know
what is going on here and allow debugging this.

At the time of the crash, these tasks were running:

1. amanda was running a dump(8)

2. I was installing manpages from /usr/src/share/man/man4

3. a cvsup for the ports tree was running (this is likely related to the
   problem)

| # fsdb -r /dev/da0s1g
| fsdb (inum: 2)> inode 175105
| current inode: directory
| I=175105 MODE=40755 SIZE=512
|         MTIME=Jan 18 15:14:48 2005 [0 nsec]
|         CTIME=Jan 18 15:14:48 2005 [0 nsec]
|         ATIME=Jun 19 03:05:43 2003 [0 nsec]
| OWNER=root GRP=wheel LINKCNT=2 FLAGS=0 BLKCNT=4 GEN=4e5151f9
| fsdb (inum: 175105)> cd ..
| component `..': fsdb: name `..' not found in current inode directory

I checked with camcontrol, the write cache is off (see below), but the
queue algorithm modifier is on and cannot be switched off.

Digging through the old structures, with find, reveals:

| 175101    4 drwxr-xr-x    3 root             wheel                 512 Sep  1  2002 /usr/X11R6/lib/perl5/site_perl/5.005/i386-freebsd
| 175102    4 drwxr-xr-x    2 root             wheel                 512 Sep  1  2002 /usr/X11R6/lib/perl5/site_perl/5.005/i386-freebsd/auto
| 175103    4 drwxr-xr-x    5 root             wheel                 512 Aug 23  2002 /usr/sup
| 175104    4 drwxr-xr-x    2 root             wheel                 512 Jan 19 13:29 /usr/sup/src-all
> 175105    4 drwxr-xr-x    2 root             wheel                 512 Jan 18 15:14 /usr/sup/ports-all
| 175106    4 drwxr-xr-x    2 root             wheel                 512 Jan 18 15:14 /usr/sup/doc-all
| 175107    4 drwxr-xr-x   22 root             wheel                1024 Sep 28 19:47 /usr/doc
| 175108    4 drwxr-xr-x    6 root             wheel                 512 Dec 19 13:26 /usr/doc/de_DE.ISO8859-1
| 175109    4 drwxr-xr-x    5 root             wheel                 512 Dec 27  2003 /usr/doc/de_DE.ISO8859-1/books

And, as expected:

| # ls -la /usr/sup/ports-all/
| #

Why can, under such circumstances, a softupdates filesystem become
corrupt so that fsck -p cannot fix it, and it loses has directories without
. and ..? kernel/softupdates bug? How can this directory become empty?

locate has this information recorded:
/usr/sup/ports-all
/usr/sup/ports-all/#cvs.cvsup-2279.0
/usr/sup/ports-all/checkouts.cvs:.

so apparently, three (checkouts.cvs:., . and ..) or four files (perhaps
the # file) have disappeared. I'm not sure if fsck will revive them, I
want to avoid destroying data useful for debugging.

Is the Queue Algorithm Modifier a problem? (see below) I cannot set this
to 0 on this drive, "camcontrol: error sending mode select command" with
-P0 and -P3. (Micropolis 4345WS)

How do I go about providing the file system metadata so someone can take
a look at it? The file system is 3.5 G in size, so anything that goes
beyond meta data is not feasible. Providing SSH access to the failed
machine may work though if I'm sent your OpenSSH v2-format key.

# camcontrol inquiry da0
pass0: <MICROP 4345WS x43h> Fixed Direct Access SCSI-2 device 
pass0: Serial Number 77HT45XXXX
pass0: 40.000MB/s transfers (20.000MHz, offset 8, 16bit), Tagged Queueing Enabled
# camcontrol modepage da0 -m8
IC:  0
ABPF:  0
CAP:  0
DISC:  0
SIZE:  0
WCE:  0
MF:  0
RCD:  0
...
# camcontrol modepage da0 -m10
RLEC:  0
Queue Algorithm Modifier:  1
QErr:  0
DQue:  0
...

-- 
Matthias Andree