PERC5 (LSI MegaSAS) Patrol Read crashes

Fri Sep 28 14:23:24 PDT 2007

We first became aware of this problem about a month ago.  A database 
server was up but was completely unresponsive to anything other than 
pings.  I power cycled it via the DRAC and after we couldn't find 
anything suspicious in the logs, we figured it was a fluke. 

Until the next day, when its twin did the same exact thing.   This time, 
I was able to get a screen shot through the DRAC console.  Using old 
daily outputs and that screenshot, we correlated the crashes to patrol 
reads.  Since then, we've only seen it "in the wild" on one other 
machine, a 1950, but I've been trying to chase the problem down without 
much luck. 

I'm fortunate to have three machines at my disposal for this testing, so 
I was able to try a variety of combinations:

Server 1:
Chassis:          2950 v1
System BIOS:      1.1.0 
PERC firmware:    1.00.01-0088 PERC F/W (from the 5.0.1-0030 A00 package)
OS:               6.2-R_p7, 6-STABLE

Server 2:
Chassis:          2950 v1
System BIOS:      1.1.0
PERC firmware:    1.03.10-0216 PERC F/W (from the 5.1.1-0040 package)
OS:               6.2-R_p7, 6-STABLE

Server 3:
Chassis:          2950 v2
System BIOS:      1.5.1
PERC firmware:    1.03.10-0216 PERC F/W (from the 5.1.1-0040 package)
OS:               6.2-R_p7

They're all running amd64 and each combination was tried with and 
without the linux_mfi.ko patches found in PR-113232.  For disks, they all 
have 2x36gb RAID1, 4x73gb RAID10 (all SAS.)  We use linux_mfi.ko+linux-megacli 
for management.

The original problem occurred during automatic patrol reads coupled with 
heavy disk load.  I've changed the delay interval for the automatic 
patrol reads and tried to reproduce it but haven't had enough success to 
make it useful for troubleshooting.  Since the automatic reads are meant 
to be as least aggressive as possible, I've been running a manual patrol 
read (megacli -AdpPR -Start -a0), which triggers a crash regardless 
of what I/O is like. 

The behavior has little to no variation; shortly after the read is 
started, disk writes immediately cease (shown via an scp from another 
machine).  After a minute, the console will begin to fill up with lines 
such as:

mif0: COMMAND 0xffffffff892bc998 TIMEOUT AFTER 45 SECONDS

The first 8 values of the hex never change - I bring that up because I 
suspect the problem has something to do with the enclosure, which is 
attached at 8, 255, or fffffff, depending on where you're looking. 

I've let it go up to 6000 seconds, but it eventually ends in a kernel panic.
That just seems to be a side effect of the original problem (processes with 
nowhere to write data), so I'm not too hung up on that.  

There's never anything pertaining to it in the controller's event log.

Besides the platform version differences I mentioned above, I've tried:
- Reducing the patrol read rate
- Pulling down and modifying the patches from PR-115133 (which seems to 
set an upper boundary at 0xffffffff)
- Invoking a0/aALL interchangeably
- Changing the cache flush interval
- Disabling disk coercion
- A bunch of other long-shot settings from megacli that aren't worth 
listing

Nothing has shown any appreciable difference in the behavior.

Does anyone have an idea about what could be going on or anything else 
we can try?  For now, I'll probably just disable them and set them 
to auto/1 hour delay during outage windows only, but I'm hoping that 
someone is able to help with this.  At the very least, maybe I can save 
someone a whole bunch of time.  

Thanks in advance for any help.

-- 
Sean McAfee
Collaborative Fusion, Inc.
  smcafee at collaborativefusion.com
  412-422-3463 x 4025

1710 Murray Avenue, Suite 320
Pittsburgh, PA 15217

****************************************************************
IMPORTANT: This message contains confidential information
and is intended only for the individual named. If the reader of
this message is not an intended recipient (or the individual
responsible for the delivery of this message to an intended
recipient), please be advised that any re-use, dissemination,
distribution or copying of this message is prohibited. Please
notify the sender immediately by e-mail if you have received
this e-mail by mistake and delete this e-mail from your system.
E-mail transmission cannot be guaranteed to be secure or
error-free as information could be intercepted, corrupted, lost,
destroyed, arrive late or incomplete, or contain viruses. The
sender therefore does not accept liability for any errors or
omissions in the contents of this message, which arise as a
result of e-mail transmission.
****************************************************************