kern/187903: Only 0.44 (always) days of uptime with ciss (w/HP SA P812)

Mon Mar 24 16:50:02 UTC 2014

>Number:         187903
>Category:       kern
>Synopsis:       Only 0.44 (always) days of uptime with ciss (w/HP SA P812)
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Mon Mar 24 16:50:00 UTC 2014
>Closed-Date:
>Last-Modified:
>Originator:     Nagy, Attila
>Release:        stable/9 at r260621 and stable/10 at r262152
>Organization:
>Environment:
>Description:
I have an HP DL360G7 with a HP SmartArray P812 in it, which crashes 
exactly (well, some minutes plus or minus, but on the graph it's nearly 
the same) at 0.44 days of uptime no matter what I do, load the machine 
until it's so hot, I can't touch it, or just leave it idle.
The P812 has an HP MDS600 connected to it with 70 1TB disks, with a 6 
disk RAID6 (ADG) setup. The volumes have 128k stripe size, because I use 
ZFS on top of them.
The zpool is simply a stripe of the RAID6 volumes.
What may be important: the controller's RAID6 initialization is still 
ongoing.

In the first sentence idle means the zpool/zfs is just mounted and only 
some stat()s happening on them (crashes after 0.44 days) and fully 
loaded means gstat shows around 100% utilization on the disks nearly all 
the time (crashes after 0.44 days also).

I've already tried with stable/9 at r260621 and stable/10 at r262152, it's the 
same.
I've also tried with Linux (Ubuntu 13.10, hpsa driver, zfs on linux 
0.6.2), it doesn't crash (neither idle or loaded).
Already swapped the machine and the P812 to a different one, no effect. 
Everything (DL360, P812, MDS600, disks) has the latest firmware.

The currently used ZFS is created under Linux to see whether this causes 
the problems, but of course there are many different things in the two 
OS (kernel, HP SA driver, block/SCSI layer and even ZFS is somewhat 
different).
Linux works, FreeBSD crashes no matter what I do.

The exact message I can see is (ciss0 is the built-in P411):
ciss1: ADAPTER HEARTBEAT FAILED

Fatal trap 1: privileged instruction fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer     = 0x20:0xfffffe0c59ff795d
stack pointer           = 0x28:0xfffffe0baf1ab9d0
frame pointer           = 0x28:0xfffffe0baf1aba20
code segment            = base 0x0, limit 0xfffff, type 0x1b
                         = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 12 (swi4: clock)
trap number             = 1
panic: privileged instruction fault
cpuid = 0
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 
0xfffffe0baf1ab560
kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0baf1ab610
panic() at panic+0x155/frame 0xfffffe0baf1ab690
trap_fatal() at trap_fatal+0x3a2/frame 0xfffffe0baf1ab6f0
trap() at trap+0x794/frame 0xfffffe0baf1ab910
calltrap() at calltrap+0x8/frame 0xfffffe0baf1ab910
--- trap 0x1, rip = 0xfffffe0c59ff795d, rsp = 0xfffffe0baf1ab9d0, rbp = 
0xfffffe0baf1aba20 ---
(null)() at 0xfffffe0c59ff795d/frame 0xfffffe0baf1aba20
softclock_call_cc() at softclock_call_cc+0x16c/frame 0xfffffe0000e77120
kernphys() at 0xffffffff/frame 0xfffffe0000e778a0
kernphys() at 0xffffffff/frame 0xfffffe0000e78aa0
kernphys() at 0xffffffff/frame 0xfffffe0000e78c20
Uptime: 10h18m12s
(da4:ciss1:0:0:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 
00 00
(da4:ciss1:0:0:0): CAM status: Command timeout
(da4:ciss1:0:0:0): Error 5, Retries exhausted
(da4:ciss1:0:0:0): Synchronize cache failed
(da5:ciss1:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 
00 00
(da5:ciss1:0:1:0): CAM status: Command timeout
(da5:ciss1:0:1:0): Error 5, Retries exhausted
(da5:ciss1:0:1:0): Synchronize cache failed
(da6:ciss1:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 
00 00
(da6:ciss1:0:2:0): CAM status: Command timeout
(da6:ciss1:0:2:0): Error 5, Retries exhausted
(da6:ciss1:0:2:0): Synchronize cache failed
(da7:ciss1:0:3:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 
00 00
(da7:ciss1:0:3:0): CAM status: Command timeout
(da7:ciss1:0:3:0): Error 5, Retries exhausted
(da7:ciss1:0:3:0): Synchronize cache failed
(da8:ciss1:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 
00 00
(da8:ciss1:0:4:0): CAM status: Command timeout
(da8:ciss1:0:4:0): Error 5, Retries exhausted
(da8:ciss1:0:4:0): Synchronize cache failed
(da9:ciss1:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 
00 00
(da9:ciss1:0:5:0): CAM status: Command timeout
(da9:ciss1:0:5:0): Error 5, Retries exhausted
(da9:ciss1:0:5:0): Synchronize cache failed
(da10:ciss1:0:6:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 
00 00
(da10:ciss1:0:6:0): CAM status: Command timeout
(da10:ciss1:0:6:0): Error 5, Retries exhausted
(da10:ciss1:0:6:0): Synchronize cache failed
(da11:ciss1:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 
00 00
(da11:ciss1:0:7:0): CAM status: Command timeout
(da11:ciss1:0:7:0): Error 5, Retries exhausted
(da11:ciss1:0:7:0): Synchronize cache failed
(da12:ciss1:0:8:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 
00 00
(da12:ciss1:0:8:0): CAM status: Command timeout
(da12:ciss1:0:8:0): Error 5, Retries exhausted
(da12:ciss1:0:8:0): Synchronize cache failed
(da13:ciss1:0:9:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 
00 00
(da13:ciss1:0:9:0): CAM status: Command timeout
(da13:ciss1:0:9:0): Error 5, Retries exhausted
(da13:ciss1:0:9:0): Synchronize cache failed
(da14:ciss1:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 
00 00
(da14:ciss1:0:10:0): CAM status: Command timeout
(da14:ciss1:0:10:0): Error 5, Retries exhausted
(da14:ciss1:0:10:0): Synchronize cache failed
Automatic reboot in 15 seconds - press a key on the console to abort
Rebooting...

Dmesg says:
ciss1: <HP Smart Array P812> port 0x5000-0x50ff mem 
0xfbe00000-0xfbffffff,0xfbdf0000-0xfbdf0fff irq 24 at device 0.0 on pci9
ciss1: PERFORMANT Transport
da5 at ciss1 bus 0 scbus2 target 1 lun 0
da4 at ciss1 bus 0 scbus2 target 0 lun 0
da6 at ciss1 bus 0 scbus2 target 2 lun 0
da7 at ciss1 bus 0 scbus2 target 3 lun 0
da8 at ciss1 bus 0 scbus2 target 4 lun 0
da9 at ciss1 bus 0 scbus2 target 5 lun 0
da10 at ciss1 bus 0 scbus2 target 6 lun 0
da11 at ciss1 bus 0 scbus2 target 7 lun 0
da12 at ciss1 bus 0 scbus2 target 8 lun 0
da13 at ciss1 bus 0 scbus2 target 9 lun 0
da14 at ciss1 bus 0 scbus2 target 10 lun 0

I also find it interesting that the machine's IML (Integrated Management 
Log) contains this message after every crash:
POST Error: 1719 - A controller failure event occurred prior to this 
power-up

Which might show that the controller indeed locks up, but why does it do 
this under FreeBSD and doesn't under Linux?
I've already tried
hw.ciss.nop_message_heartbeat=1;ciss_force_transport=1;ciss_force_interrupt=1
without any effect (it freezes after the same time).

Last time during the POST the controller said:
Slot 2  HP Smart Array P812 Controller       (1024MB, v6.40)  11 Logical 
Drives
1719-Slot 2 Drive Array - A controller failure event occurred prior to this
      power-up.  (Previous lock up code = 0x13)

Any ideas on what could cause this?

Mailing list link:
http://lists.freebsd.org/pipermail/freebsd-scsi/2014-March/006292.html
>How-To-Repeat:
(at least here)
Create 11 RAID6 volumes (6 disks each) on a SmartArray P812 with a 128k stripe size, format it with ZFS, leave the system alone for 0.44 days and it crashes.
>Fix:

>Release-Note:
>Audit-Trail:
>Unformatted: