svn commit: r203889 - in stable/8/sys: cam cam/ata cam/scsi dev/ahci dev/asr dev/ata dev/ciss dev/hptiop dev/hptrr dev/mly dev/mpt dev/ppbus dev/siis dev/trm dev/twa dev/usb/storage

Thu Feb 18 14:27:41 UTC 2010

Hi Alexander and all,

On 02/15/10 06:38, Alexander Motin wrote:
> Author: mav
> Date: Sun Feb 14 19:38:27 2010
> New Revision: 203889
> URL: http://svn.freebsd.org/changeset/base/203889
>
> Log:
>    MFC r203108:
>    Large set of CAM improvements:

[snip]

I've been having issues with the mpt-driven LSI SAS adapter in my 
SunFire X4100 server running FreeBSD 8-STABLE r202132. Under certain 
disk workloads like running an svn update of the src tree or kernel 
compile, the disk subsystem will become extremely unresponsive in a 
stalled like state, and /var/log/messages will report a number of these:

mpt0: mpt_cam_event: 0x16

It does eventually come good after a minute or two even though the svn 
op or build is still running, then it will maybe repeat a few times 
stalled/good behaviour sometimes with minutes between events.

A couple of times it has gotten even more upset reporting things like this:

mpt0: mpt_cam_event: 0x16
mpt0: mpt_cam_event: 0x16
mpt0: request 0xffffff80002f1400:54058 timed out for ccb 
0xffffff0001c65000 (req->ccb 0xffffff0001c65000)
mpt0: attempting to abort req 0xffffff80002f1400:54058 function 0
mpt0: request 0xffffff80002fd100:54059 timed out for ccb 
0xffffff009f3ec800 (req->ccb 0xffffff009f3ec800)
mpt0: request 0xffffff80002efcf0:54060 timed out for ccb 
0xffffff0001bd2000 (req->ccb 0xffffff0001bd2000)
mpt0: mpt_recover_commands: IOC Status 0x4a. Resetting controller.
mpt0: mpt_cam_event: 0x0
mpt0: mpt_cam_event: 0x0
mpt0: completing timedout/aborted req 0xffffff80002f1400:54058
mpt0: completing timedout/aborted req 0xffffff80002fd100:54059
mpt0: completing timedout/aborted req 0xffffff80002efcf0:54060
mpt0: mpt_cam_event: 0x16
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x16
mpt0: Volume(0:2): Volume Status Changed
mpt0: request 0xffffff80002f8990:0 timed out for ccb 0xffffff009f3cb800 
(req->ccb 0)

No ill effects are observed after such an episode and the array remains 
in healthy as-normal state. The only observable problem is the stall of 
all disk IO while these events occur.

The disk configuration is 2 x 320GB WD3200BEKT 7200RPM SATA HDDs in 
RAID1. The hardware reports itself as:

mpt0: <LSILogic SAS/SATA Adapter> port 0xa800-0xa8ff mem 
0xfc4fc000-0xfc4fffff,0xfc4e0000-0xfc4effff irq 28 at device 3.0 on pci2
mpt0: [ITHREAD]
mpt0: MPI Version=1.5.13.0
mpt0: Capabilities: ( RAID-0 RAID-1E RAID-1 )
mpt0: 1 Active Volume (2 Max)
mpt0: 2 Hidden Drive Members (10 Max)

mpt0 at pci0:2:3:0:        class=0x010000 card=0x30601000 chip=0x00501000 
rev=0x02 hdr=0x00
     vendor     = 'LSI Logic (Was: Symbios Logic, NCR)'
     device     = 'SAS 3000 series, 4-port with 1064 -StorPort'
     class      = mass storage
     subclass   = SCSI

As best I can tell, the hardware is ok, both disks report as fine 
without SMART errors and are only 2 months old, so wanted to rule out 
software issues. On upgrading to recent 8-STABLE, I got a page fault 
kernel panic on boot in the mpt driver mpt_raid0 kproc. After some trial 
and error, r203888 is the most recent revision that boots fine, whilst 
r203889 exhibits the page fault. I should also note that r203888 still 
sees the "mpt0: mpt_cam_event: 0x16" messages and associated disk IO stalls.

I compiled DDB into my r203889 kernel. Unfortunately my ILO emulates a 
USB keyboard so I can't do anything in DDB which is a huge pain, but 
here's the info I did get (hand transcribed):

Fatal trap 12: page fault while in kernel mode
current process: mpt_raid0
Stopped at xpt_rescan+0x1d:     movq   0x10(%rsi),%rdx

So there are two separate issues here:

1. Any thoughts on how to resolve the regression in the mpt driver with 
the r203889 commit?

2. Any thoughts on the behaviour I'm seeing with the mpt_cam_event 
messages? Is it possible it's just a driver issue? Is the hardware 
likely bad? I'm really hoping they'll go away once the driver issue is 
resolved as the freezes are fairly unacceptable on a production machine 
and the hardware appears to pass all checks I've done so far.

Cheers,
Lawrence