ZFS HBAs + LSI chip sets (Was: ZFS hang (system #2))

Tue Oct 23 01:55:46 UTC 2012

----- Dennis Glatting's Original Message -----
> On Mon, 2012-10-22 at 09:31 -0700, Freddie Cash wrote:
> > On Mon, Oct 22, 2012 at 6:47 AM, Freddie Cash <fjwcash at gmail.com> wrote:
> > > I'll double-check when I get to work, but I'm pretty sure it's 10.something.
> > 
> > mpt(4) on alpha has firmware 1.5.20.0.
> > 
> > mps(4) on beta has firmware 09.00.00.00, driver 14.00.00.01-fbsd.
> > 
> > mps(4) on omega has firmware 10.00.02.00, driver 14.00.00.01-fbsd.
> > 
> > Hope that helps.
> > 
> 
> Because one of the RAID1 OS disks failed (System #1), I replaced both
> disks and downgraded to stable/8. Two hours ago I submitted a job. 
> 
> I noticed on boot smartd issued warnings about disk firmware, which I'll
> update this coming weekend, unless the system hangs before then. 
> 
> I first want to see if that system will also hang under 8.3. I have
> noticed a looping "ls" of the target ZFS directory is MUCH snappier
> under 8.3 than 9.x. 
> 
> My CentOS 6.3 ZFS-on-Linux system (System #3) is crunching along (24
> hours now). This system under stable/9 would previously spontaneously
> reboot whenever I sent a ZFS data set too it.
> 
> System #2 is hung (stable/9).

Hi Folks,

   I just caught up on this thread and thought I toss out some info.

   I have a number of systems running 9-stable (with some local patches),
none running 8.

   The basic architecture is: http://people.freebsd.org/~jwd/zfsnfsserver.jpg

   LSI SAS 9201-16e  6G/s 16-Port SATA+SAS Host Bus Adapter

   All cards are up-to-date on firmware:

mps0: Firmware: 14.00.00.00, Driver: 14.00.00.01-fbsd 
mps1: Firmware: 14.00.00.00, Driver: 14.00.00.01-fbsd   
mps2: Firmware: 14.00.00.00, Driver: 14.00.00.01-fbsd

   All drives a geom multipath configured.

   Currently, these systems are used almost exclusively for iSCSI.

   I have seen no lockups that I can track down to the driver. I have seen
one lockup which I did post about (received no feedback) where I believe
an active I/O from istgt is interupted by an ABRT from the client which
causes a lock-up. This one is hard to replicate and on the do-do list.

   It is worth noting that a few drives were replaced early on
due to various I/O problems and one with what might be considered a
lockup. As has been noted elsewhere, watching gstat can be informative.
Also make sure cables are firmly plugged in.. Seems obvious, I know..

   I did recently commit a small patch to current to handle a case
where if the system has greater than 255 disks, the 255th disk
is hidden/masked by the mps initiator id that is statically coded into
the driver.

   I think it might be good to document a bit better the type of
mount and test job/test stream running when/if you see a lockup.
I am not currently using NFS so there is an entire code-path I
am not exercising.

   Servers are 12 processor, 96GB Ram. The highest cpu load I've
seen on the systems is about 800%.

   All networking is 10G via Chelsio cards - configured to
use isr maxthread 6 with a defaultqlimit of 4096.  I have seen
no problems in this area.

   Hope this helps a bit. Happy to answer questions.

Cheers,
John

ps: With all that's been said above, it's worth noting that a correctly
    configured client makes a huge difference.