kern/179932: ciss i/o stall problem with HP Bl Gen8 (and HP Bl Gen7 + Storage Blade)

Mon Jun 24 17:50:01 UTC 2013

>Number:         179932
>Category:       kern
>Synopsis:       ciss i/o stall problem with HP Bl Gen8 (and HP Bl Gen7 + Storage Blade)
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Mon Jun 24 17:50:00 UTC 2013
>Closed-Date:
>Last-Modified:
>Originator:     Philipp Maechler
>Release:        9.1
>Organization:
Hostpoint AG
>Environment:
FreeBSD XX.hostpoint.ch 9.1-RELEASE-p4 FreeBSD 9.1-RELEASE-p4 #9 r251905M: Tue Jun 18 07:48:10 UTC 2013     root at XX.hostpoint.ch:/usr/obj/usr/src/sys/HOSTPOINT  amd64

Kernel Build Config Used:

include GENERIC

ident HOSTPOINT

options        IPFIREWALL
options        IPFIREWALL...
options        IPFIREWALL...
options        IPFIREWALL...
options        IPFIREWALL...
options        IPDIVERT
options        KDB  
options        DDB                 
options        KDTRACE_HOOKS        
options        DDB_CTF              
options        KDTRACE_FRAME        
makeoptions DEBUG="-g"
makeoptions    WITH_CTF=1

Hardware:
* HP BL ProLiant 465c Gen8 with 2 internal disks (smart array p220i)
* HP BL ProLiant 465c Gen7 with 2 internal and 8 external disks in HP Storage Blade D2200sb (2x smart array p410i)

not affected by this problems are all ProLiant 465c Gen6 and all
ProLiant 465c Gen7 without storage blade d2200sb. We are not sure if the
Gen7 blades without storage blade do not have this problem at all or if
we just not trigger it yet. In total we are running about 130 of those
blade servers affected by this problem.

>Description:
Our blade servers doing in normal operation a very high level of disk
i/o operations, especially the systems with storage blades witch are
used as database servers. The overall i/o performance is okay, but from
time to time the raid controller or the raid controller driver of
FreeBSD does stop handling any i/o requests and completely stall. So far
we could figure out that the problem must belong to the driver or
controller. During such an i/o stall the sever is still operating but
all disk i/o operations are queued by the kernel and do not get finished
any more by controller or disks, we also tried to directly access the
disks during such a stall, avoiding any filesystem levels, but we also
can only see that those requests got queued and never finish by the
disks. The only solution in such a case, to get the system back working,
is to completely reboot it.

So far we did several analysis on logs and kernel stats do figure out
where to problem could belong to. We could manage to exclude almost
everything except the raid controller itself and the kernel driver
(ciss). While expecting the i/o stall we tried to write directly to the
swap partitions of different disks, but this i/o could also not get
finished. With this tests we tried to exclude any problems caused by the
filesystem.

It is very difficult for us to reproduce the needed i/o to cause this
problem. So far we just know, that systems expecting higher disk i/o are
more often affected by the problem. Some of our database servers
expecting during peak hours extraordinary high disk i/o and crashes
occur more often. Some systems are running a couple of weeks without any
problems and some do crash several times a day - but then again they run
1-2 weeks without any problems.

We are also not sure if we face exactly the same problem on servers with
P220i and server with P410i raid controllers, but the symptoms on the
systems are very similar.

We had contact with Linux / cciss staff and some of they developers recommended us to fill out a p/r - here it is.

At the moment we are testing the "old SIMPLE Mode" activated by loader.conf on our Gen8 Machines; earlier tests on our Gen7-StorageBlade systems did not success, to be clear it reduced the frequency of happening but not resolved it (on Gen7-StorageBlade).
>How-To-Repeat:
In short: Heavy i/o and network operations for around 1-2 week on the listed kind of hardware; the more machines you have, the more often the i/o stall happen.

With 15 Gen8 machines we are experiencing the problem around 6-7 times / week; often 2-3 days are completely without problems. The problem also occurs on non-heavy-load-times like in the early morning (without backup running).

We unloaded the Gen7-StorageBlade systems for no longer interrupt productive environments; but on a hot-stby system we can reproduce it every 1-3 weeks as long as we are scrubbing all the time (for reproducing it...).

In detail:
It is very difficult for us to reproduce the needed i/o to cause this
problem. So far we just know, that systems expecting higher disk i/o are
more often affected by the problem. Some of our database servers
expecting during peak hours extraordinary high disk i/o and crashes
occur more often. Some systems are running a couple of weeks without any
problems and some do crash several times a day - but then again they run
1-2 weeks without any problems.

We are also not sure if we face exactly the same problem on servers with
P220i and server with P410i raid controllers, but the symptoms on the
systems are very similar.

>Fix:

>Release-Note:
>Audit-Trail:
>Unformatted: