ahc + vinum raid5 deadlocks?

Fri Feb 20 23:58:05 PST 2004

Hi,
Cross posting to -scsi as it seems maybe related to my scsi
setup.
At least, the problems currently only appear when the scsi parts
are in use.

On Sat, Feb 21, 2004 at 02:47:49PM +1100, Tony Frank wrote:
> Refer earlier thread on -stable for more background.
> 
> 4.9-STABLE (cvsup 20th Feb)
> Kernel was compiled with DDB, INVARIANTS, DIAGNOSTICS.
> All options removed from /etc/make.conf except 'NOPROFILE=TRUE'
> 
> I get the following on the console before system freezes:
> ahc0: WARNING no command for scb 14 (cmdcmplt)
> QOUTPOS = 44

This occured while doing majority of i/o to vinum raid5 volume.

Since that time I have removed the vinum raid5 configuration
and am using the disks directly - ufs mounted on /dev/da0s1h.

When performing the same load benchmark I have received two
separate panic's:

First panic occurs due to KASSERT in ffs_read, refer:
http://fxr.watson.org/fxr/source//ufs/ufs/ufs_readwrite.c?v=RELENG4#L316

%%%%
(da0:ahc0:0:0:0): tagged openings now 40
(da1:ahc0:0:1:0): tagged openings now 40
(da2:ahc0:0:2:0): tagged openings now 40
(da3:ahc0:0:3:0): tagged openings now 40
Feb 21 15:48:16 raider su: tony to root on /dev/ttypc
panic: bp->b_resid != 0

syncing disks... Stopped at      siointr1+0x102: movl    $0,brk_state2.757
db> trace
siointr1(c0f98000,ccdc6c64,c027d546,c0f98000,10) at siointr1+0x102
siointr(c0f98000,10,0,0,0) at siointr+0xb
Xfastintr4(c1090800,1000040,600,20002,ccd54300) at Xfastintr4+0x16
lockmgr(c1090800,1030002,ccd5436c,c032c7a0,ccdc6cac) at lockmgr+0x1fc
vop_stdlock(ccdc6cc4,ccdc6cd4,c01bb269,ccdc6cc4,0) at vop_stdlock+0x20
ufs_vnoperatespec(ccdc6cc4) at ufs_vnoperatespec+0x15
vn_lock(ccd54300,20002,c032c7a0,c1129000,0) at vn_lock+0x71
ffs_sync(c1129000,2,c0b2d600,c032c7a0,c1129000) at ffs_sync+0x17f
sync(c032c7a0,0,c02b9660,c02d6048,100) at sync+0x63
boot(100,0,0,ccdc6de0,c02482a2) at boot+0x8a
panic(c02d6048,ccf5cbc0,5b,ccd8f000,400) at panic+0x79
ffs_read(ccdc6df4,0,ccdc6ea8,c02ec1e0,ccf5cbc0) at ffs_read+0x37a
ufs_readlink(ccdc6e38,ccdc6e68,c01b1c8a,ccdc6e38,c7f7bee0) at ufs_readlink+0x6b
ufs_vnoperate(ccdc6e38,c7f7bee0,c7f7bee0,ccdc6f80,c7f7bee0) at ufs_vnoperate+0x15
namei(ccdc6e80,c7f7bee0,2,ccdc6f80,8137400) at namei+0x302
stat(c7f7bee0,ccdc6f80,bfbe7810,bfbe8298,bfbea49c) at stat+0x41
syscall2(c027002f,2f,2f,bfbea49c,bfbe8298) at syscall2+0x209
Xint0x80_syscall() at Xint0x80_syscall+0x25
db>
%%%%

I have a core from this panic saved.  'trace' is about the extent of
my ddb skills at the moment though.

Checking the archives a similar problem was seen ~6months ago with
particular SCSI disk and having too high tags value.
Refer:
 http://docs.freebsd.org/cgi/mid.cgi?FE045D4D9F7AED4CBFF1B3B813C8533702741FEA

>From what I can see, the ahc driver is forcing a max of 40 tags.
The SCSI hardware is fairly old but was working without problems in
the old system (Win based)

Second panic occurs in ffs_softdep, refer:
http://fxr.watson.org/fxr/source//ufs/ffs/ffs_softdep.c?v=RELENG4#L3590

%%%%
Feb 21 17:16:30 raider su: tony to root on /dev/ttyp8
(da1:ahc0:0:1:0): tagged openings now 40
Feb 21 17:21:29 raider su: tony to root on /dev/ttyp3
(da3:ahc0:0:3:0): tagged openings now 40
(da2:ahc0:0:2:0): tagged openings now 40
panic: handle_written_inodeblock: live inodedep

syncing disks... Stopped at      siointr1+0x102: movl    $0,brk_state2.757
db> trace
siointr1(c0f98000,ccf6bba0,c027d546,c0f98000,10) at siointr1+0x102
siointr(c0f98000,10,0,8,68c040) at siointr+0xb
Xfastintr4(c389c20c,10,c02be0a4,0) at Xfastintr4+0x16
biowait(c389c20c,c106b800,c11d1900,2,c02ec760) at biowait+0x37
bread(ccaaba00,18c040,2000,0,ccf6bc14) at bread+0xb2
ffs_update(ccd51c00,0,68c040,ccd51c00,ccf33580) at ffs_update+0xba
ffs_fsync(ccf6bc78) at ffs_fsync+0x358
ffs_sync(c1069c00,2,c0b2d600,c032c7a0,c1069c00) at ffs_sync+0xdb
sync(c032c7a0,0,c02b9660,c02d56e0,100) at sync+0x63
boot(100,0,c127bb00,ccf6bd18,c024416a) at boot+0x8a
panic(c02d56e0,1,c11a5c80,c38d6178,0) at panic+0x79
handle_written_inodeblock(c11a5c80,c38d6150) at handle_written_inodeblock+0x30e
softdep_disk_write_complete(c38d6150) at softdep_disk_write_complete+0x6a
biodone(c38d6150,1,68c040,c10efa48,c38d6150) at biodone+0x121
complete_rqe(c10efa20,0,c1028c00,f76,c1028d3c) at complete_rqe+0x651
biodone(c10efa20,c1028cb8,c10efa20,c0146d58,c16caac0) at biodone+0xf5
ad_interrupt(c16caac0,c032b7d4,ccf6be38,c0182482,c0fa7900) at ad_interrupt+0x3e7
ata_intr(c0fa7900,c16ccdc0,ccf6be8c,c027e4c2,c032b7d4) at ata_intr+0xd8
add_interrupt_randomness(c032b7d4,0,10,c3870010,c01a0010) at add_interrupt_rando
mness+0xe
Xresume15() at Xresume15+0x2b
--- interrupt, eip = 0xc01ab6e5, esp = 0xccf6be80, ebp = 0xccf6be8c ---
bwillwrite(c16ccdc0,ccf6bf80,cce7b2a0,0,0) at bwillwrite+0x75
dofilewrite(cce7b2a0,c16ccdc0,a,824a00c,f76) at dofilewrite+0xa2
write(cce7b2a0,ccf6bf80,824a00c,bfbff7e0,bfbff800) at write+0x36
syscall2(bfbf002f,bfbf002f,822002f,bfbff800,bfbff7e0) at syscall2+0x209
Xint0x80_syscall() at Xint0x80_syscall+0x25
db>
%%%%

My searches show a few hits for -current and some old items from -stable.
Nothing jumps out at me as a possible solution?
This trace seems to suggest the problem may have been on the IDE disk?

Again the core has been saved so given suitable directions I can do something
with it.

> Hardware is currently Adaptec AHA-2940W S71 (F/w 1.19S8)
> Same issues occured with Adaptec AHA-2940UW/B (F/w 1.32S8)
> I have 4 x 4G IBM SCSI disks on the single internal 68-pin connector.
> Connections are good & termination is correct
> The 4 disks are combined into a single vinum raid5 plex.
> System is well ventilated and cool to the touch.
> Pentium 2 200Mhz, 128mb SDRAM, Asus P2V m/b.
> 
> Currently takes ~2 hours of solid activity to trigger the issue.

These problems are occuring with load after about 30-50 mins of solid
activity.
a 'plain' "make -j4 buildworld buildkernel" completes ok if it runs
by itself without the extra load of my test bench.

> My 'test bench':
> Copying files over NFSv3/udp to vinum raid5 volume via cp & tar
> + make -j4 buildworld
> + Copying large dir trees (/usr/ports, /usr/obj, /usr/src)
> + cvsuping second copy of /usr/src
> + extracting tar archive of /usr/obj

Ideas are welcome,

Tony