PERC5 (LSI MegaSAS) Patrol Read crashes
Sean McAfee
smcafee at collaborativefusion.com
Fri Sep 28 14:23:24 PDT 2007
We first became aware of this problem about a month ago. A database
server was up but was completely unresponsive to anything other than
pings. I power cycled it via the DRAC and after we couldn't find
anything suspicious in the logs, we figured it was a fluke.
Until the next day, when its twin did the same exact thing. This time,
I was able to get a screen shot through the DRAC console. Using old
daily outputs and that screenshot, we correlated the crashes to patrol
reads. Since then, we've only seen it "in the wild" on one other
machine, a 1950, but I've been trying to chase the problem down without
much luck.
I'm fortunate to have three machines at my disposal for this testing, so
I was able to try a variety of combinations:
Server 1:
Chassis: 2950 v1
System BIOS: 1.1.0
PERC firmware: 1.00.01-0088 PERC F/W (from the 5.0.1-0030 A00 package)
OS: 6.2-R_p7, 6-STABLE
Server 2:
Chassis: 2950 v1
System BIOS: 1.1.0
PERC firmware: 1.03.10-0216 PERC F/W (from the 5.1.1-0040 package)
OS: 6.2-R_p7, 6-STABLE
Server 3:
Chassis: 2950 v2
System BIOS: 1.5.1
PERC firmware: 1.03.10-0216 PERC F/W (from the 5.1.1-0040 package)
OS: 6.2-R_p7
They're all running amd64 and each combination was tried with and
without the linux_mfi.ko patches found in PR-113232. For disks, they all
have 2x36gb RAID1, 4x73gb RAID10 (all SAS.) We use linux_mfi.ko+linux-megacli
for management.
The original problem occurred during automatic patrol reads coupled with
heavy disk load. I've changed the delay interval for the automatic
patrol reads and tried to reproduce it but haven't had enough success to
make it useful for troubleshooting. Since the automatic reads are meant
to be as least aggressive as possible, I've been running a manual patrol
read (megacli -AdpPR -Start -a0), which triggers a crash regardless
of what I/O is like.
The behavior has little to no variation; shortly after the read is
started, disk writes immediately cease (shown via an scp from another
machine). After a minute, the console will begin to fill up with lines
such as:
mif0: COMMAND 0xffffffff892bc998 TIMEOUT AFTER 45 SECONDS
The first 8 values of the hex never change - I bring that up because I
suspect the problem has something to do with the enclosure, which is
attached at 8, 255, or fffffff, depending on where you're looking.
I've let it go up to 6000 seconds, but it eventually ends in a kernel panic.
That just seems to be a side effect of the original problem (processes with
nowhere to write data), so I'm not too hung up on that.
There's never anything pertaining to it in the controller's event log.
Besides the platform version differences I mentioned above, I've tried:
- Reducing the patrol read rate
- Pulling down and modifying the patches from PR-115133 (which seems to
set an upper boundary at 0xffffffff)
- Invoking a0/aALL interchangeably
- Changing the cache flush interval
- Disabling disk coercion
- A bunch of other long-shot settings from megacli that aren't worth
listing
Nothing has shown any appreciable difference in the behavior.
Does anyone have an idea about what could be going on or anything else
we can try? For now, I'll probably just disable them and set them
to auto/1 hour delay during outage windows only, but I'm hoping that
someone is able to help with this. At the very least, maybe I can save
someone a whole bunch of time.
Thanks in advance for any help.
--
Sean McAfee
Collaborative Fusion, Inc.
smcafee at collaborativefusion.com
412-422-3463 x 4025
1710 Murray Avenue, Suite 320
Pittsburgh, PA 15217
****************************************************************
IMPORTANT: This message contains confidential information
and is intended only for the individual named. If the reader of
this message is not an intended recipient (or the individual
responsible for the delivery of this message to an intended
recipient), please be advised that any re-use, dissemination,
distribution or copying of this message is prohibited. Please
notify the sender immediately by e-mail if you have received
this e-mail by mistake and delete this e-mail from your system.
E-mail transmission cannot be guaranteed to be secure or
error-free as information could be intercepted, corrupted, lost,
destroyed, arrive late or incomplete, or contain viruses. The
sender therefore does not accept liability for any errors or
omissions in the contents of this message, which arise as a
result of e-mail transmission.
****************************************************************
More information about the freebsd-hardware
mailing list