ahc and massive ffs+softupdates corruption

Matthias Andree ma at dt.e-technik.uni-dortmund.de
Tue Feb 17 15:20:20 PST 2004


Hi,

I have seen a SCSI timeout that preceded massive file system corruption
some days ago. As this happened with the drive's write cache turned off
and with ffs softupdates, I think it's worth reporting.

Hardware: Micropolis 4345WS (UWSCSI disk drive, 7200/min, device da0)
attached to an Adaptec 2940UW Pro (ahc0, aic7880 chip); on the same bus
there's a Yamaha CRW4416S CD writer (current firmware) and a Plextor
PX-20TS CD-ROM (firmware 1.00).

This machine had a SCSI timeout problem on Friday Feb 6th and went down
hard, suffering massive file system corruption on /var. At that time,
the machine was running portupgrade -a. /var is using softupdates and
uses default mount options. As said before, the drive's FWC enable was
set to 0 in both the current and saved editions of mode page 8, and I
wonder how such massive corruption can happen. I was under the
impression that softupdates prevented any on-disk corruptions that
require user intervention at fsck time. Given that the write cache was
off, I am wondering if there are any ffs+softupdates or tagged command
queueing bugs left (that might reorder writes - ordered tag forgotten or
something).

Here's the log, first the SCSI crash, then, on Feb 7, then excerpts from
the reboot including fsck. / and /usr also suffered minor file system
inconsistencies that "fsck -p" was able to fix automatically.

Feb  6 20:10:20 libertas /kernel: swap_pager: indefinite wait buffer: device: #da/0x20001, blkno: 4296, size: 24576
Feb  6 20:10:41 libertas /kernel: (da0:ahc0:0:0:0): SCB 0x0 - timed out
Feb  6 20:13:12 libertas /kernel: (da0:ahc0:0:0:0): Other SCB Timeout
Feb  6 20:13:12 libertas /kernel: (da0:ahc0:0:0:0): SCB 0x21 - timed out
Feb  6 20:13:21 libertas /kernel: (da0:ahc0:0:0:0): BDR message in message buffer
Feb  6 20:13:21 libertas /kernel: (da0:ahc0:0:0:0): SCB 0x21 - timed out
Feb  6 20:13:30 libertas /kernel: (da0:ahc0:0:0:0): no longer in timeout, status = 34b
Feb  6 20:13:30 libertas /kernel: ahc0: Issued Channel A Bus Reset. 12 SCBs aborted
Feb  6 20:13:30 libertas /kernel: swap_pager: indefinite wait buffer: device: #da/0x20001, blkno: 4296, size: 24576
Feb  6 20:13:30 libertas /kernel: swap_pager: indefinite wait buffer: device: #da/0x20001, blkno: 2056, size: 8192
Feb  6 20:13:30 libertas /kernel: (da0:ahc0:0:0:0): SCB 0x20 - timed out
Feb  6 20:13:37 libertas /kernel: (da0:ahc0:0:0:0): Other SCB Timeout
Feb  6 20:13:37 libertas /kernel: (da0:ahc0:0:0:0): SCB 0xa - timed out
Feb  6 20:13:44 libertas /kernel: (da0:ahc0:0:0:0): BDR message in message buffer
Feb  6 20:13:44 libertas /kernel: (da0:ahc0:0:0:0): SCB 0xa - timed out
Feb  6 20:13:51 libertas /kernel: (da0:ahc0:0:0:0): no longer in timeout, status = 34b
Feb  6 20:13:51 libertas /kernel: ahc0: Issued Channel A Bus Reset. 3 SCBs aborted
Feb  6 20:13:51 libertas /kernel: (da0:ahc0:0:0:0): SCB 0x22 - timed out
Feb  6 20:13:58 libertas /kernel: (da0:ahc0:0:0:0): Other SCB Timeout
Feb  6 20:13:58 libertas /kernel: (da0:ahc0:0:0:0): SCB 0xa - timed out
Feb  6 20:14:05 libertas /kernel: (da0:ahc0:0:0:0): BDR message in message buffer
Feb  6 20:14:05 libertas /kernel: (da0:ahc0:0:0:0): SCB 0xa - timed out
Feb  6 20:14:14 libertas /kernel: (da0:ahc0:0:0:0): no longer in timeout, status = 34b
Feb  6 20:14:14 libertas /kernel: ahc0: Issued Channel A Bus Reset. 4 SCBs aborted
Feb  6 20:15:01 libertas /usr/sbin/cron[58193]: (root) CMD (/usr/libexec/atrun) 
Feb  6 20:20:00 libertas /usr/sbin/cron[64038]: (root) CMD (/usr/libexec/atrun) 

Feb  7 12:37:03 libertas /kernel: ahc0: <Adaptec 2940 Pro Ultra SCSI adapter> port 0xe800-0xe8ff mem 0xe3020000-0xe3020fff irq 15 at devi
ce 8.0 on pci0
Feb  7 12:37:03 libertas /kernel: aic7880: Ultra Wide Channel A, SCSI Id=7, 16/253 SCBs

Feb  7 12:37:04 libertas /kernel: da0 at ahc0 bus 0 target 0 lun 0
Feb  7 12:37:04 libertas /kernel: da0: <MICROP 4345WS x43h> Fixed Direct Access SCSI-2 device 
Feb  7 12:37:04 libertas /kernel: da0: 40.000MB/s transfers (20.000MHz, offset 8, 16bit), Tagged Queueing Enabled
Feb  7 12:37:04 libertas /kernel: da0: 4340MB (8890029 512 byte sectors: 255H 63S/T 553C)

Feb  7 12:37:04 libertas /kernel: /dev/da0s1e: 
Feb  7 12:37:04 libertas /kernel: UNKNOWN FILE TYPE I=256
Feb  7 12:37:04 libertas /kernel: 
Feb  7 12:37:04 libertas /kernel: /dev/da0s1e: UNEXPECTED SOFT UPDATE INCONSISTENCY; RUN fsck MANUALLY.
Feb  7 12:37:04 libertas /kernel: THE FOLLOWING FILE SYSTEM HAD AN UNEXPECTED INCONSISTENCY:
Feb  7 12:37:04 libertas /kernel: 
Feb  7 12:37:04 libertas /kernel: /dev/da0s1e (/var)
Feb  7 12:37:04 libertas /kernel: Automatic file system check failed . . . help!
Feb  7 12:37:04 libertas /kernel: Enter full pathname of shell or RETURN for /bin/sh: 
Feb  7 12:37:04 libertas /kernel: # 

Entered shell and typed "fsck":

Feb  7 12:37:04 libertas /kernel: ** /dev/da0s1e
Feb  7 12:37:04 libertas /kernel: ** Last Mounted on /var
Feb  7 12:37:04 libertas /kernel: ** Phase 1 - Check Blocks and Sizes

Feb  7 12:37:04 libertas /kernel: UNKNOWN FILE TYPE I=256
Feb  7 12:37:04 libertas /kernel: 
Feb  7 12:37:04 libertas /kernel: UNEXPECTED SOFT UPDATE INCONSISTENCY
Feb  7 12:37:04 libertas /kernel: 
Feb  7 12:37:04 libertas /kernel: CLEAR? [yn] 
Feb  7 12:37:04 libertas /kernel: 

Last 6 lines repeat with "I=" figures monotonically increasing, finally:

Feb  7 12:37:04 libertas /kernel: UNKNOWN FILE TYPE I=264
Feb  7 12:37:04 libertas /kernel: 
Feb  7 12:37:04 libertas /kernel: UNEXPECTED SOFT UPDATE INCONSISTENCY
Feb  7 12:37:04 libertas /kernel: 
Feb  7 12:37:04 libertas /kernel: CLEAR? [yn] 
Feb  7 12:37:04 libertas /kernel: 
Feb  7 12:37:04 libertas /kernel: 
Feb  7 12:37:04 libertas /kernel: UNREF FILE 

The log ends here, but isn't complete, I have agreed to reconnect dozens
of files to lost+found after fsck asked, many but not all of them from
/var/db/pkg.

-- 
Matthias Andree

Encrypt your mail: my GnuPG key ID is 0x052E7D95


More information about the freebsd-stable mailing list