Problem with a LSILogic SAS/SATA adapter on 8.2-STABLE/ZFSv28

Sat Jun 18 14:45:39 UTC 2011

On Sat, Jun 18, 2011 at 11:07:38PM +0900, Stephane LAPIE wrote:
> I have a problem with my 8.2-STABLE/ZFSv28 server. I am currently
> upgrading my disks from 1.5TB Seagate drives to 2TB Seagate drives, and
> therefore replacing devices within ZFS. (I have activated deduplication
> on a few file systems, for the record)
> 
> I think this is more related to a hardware problem (flaky memory ? flaky
> controller/driver maybe ?), but I would appreciate any input.
> 
> I experienced several kernel panics, all of which seem to point at mpt0
> mis-handling interrupts :
> www.darkbsd.org/~darksoul/kernel-panic-mpt1.txt (no target cmd ptrs)
> www.darkbsd.org/~darksoul/kernel-panic-mpt2.txt (mpt_intr index == ...)
> www.darkbsd.org/~darksoul/kernel-panic-mpt3.txt (NMI in kernel mode)
> www.darkbsd.org/~darksoul/kernel-panic-mpt4.txt (LAN CONTEXT REPLY)
> www.darkbsd.org/~darksoul/kernel-panic-mpt5.txt (LAN CONTEXT REPLY)
> www.darkbsd.org/~darksoul/kernel-panic-mpt6.txt (LAN CONTEXT REPLY)
> www.darkbsd.org/~darksoul/kernel-panic-mpt7.txt (LAN CONTEXT REPLY)
> 
> I would appeciate any pointers to what on earth "LAN CONTEXT REPLY"
> means for an LSI controller (using driver mpt(4)), as I have no idea,
> and the source was not really helpful.
> 
> The error message about an NMI and RAM parity error is what is scaring
> me the most here, and points me in the direction of flaky memory.
> 
> This is a personal machine, so I can add debug options and try stuff if
> it can help figure out what is going on. Also, any critical data is
> replicated, backed up and accounted for.

For readers, the NMI and RAM parity error message in question is
shown here:

http://www.darkbsd.org/~darksoul/kernel-panic-mpt2.txt

But is difficult to decode due to the well-established problem with the
FreeBSD kernel interspersing text output.  (I imagine this gets worse
the more cores you have on your system, but that's not relevant to this
discussion)

Anyway, to expand on the "RAM parity error" and NMI message: this
information I'm going to give you isn't specific to the LSI controller;
it's a general piece of information.  I've talked about this in the
past.  Please read it and focus on the SERR/PERR and NMI details:

http://lists.freebsd.org/pipermail/freebsd-fs/2011-March/010938.html

If you want to rule out actual system RAM issues, I would recommend
running memtest86 for about 30 minutes, and then memtest86+ for the same
amount of time.  This might sound crazy ("why can't I just run one?!"),
but you need to review the ChangeLog for memtest86 to see why.  Their
support for detecting corrected ECC errors was removed with 4.0, but in
4.0 they added multi-CPU support (which is good to have in this
situation), while memtest86 may still have support for ECC.

Neither of these utilities are as excellent as a hardware RAM tester
(which does cool things like sending extreme amounts of voltage through
each DRAM module, looks for soft and hard errors, etc.), but those are
expensive.  Usually system memory problems will show up in memtest86/86+
pretty quickly though.

All that said: it may be possible that the NMIs you're seeing aren't
being induced by system RAM issues at all, but somehow are being
generated or caused by the LSI controller.  I wasn't under the
impression that a PCIe MSI and/or MSI-X generated an NMI, but I could be
completely wrong.

You may want to try the memtest86/86+ tests with and without the LSI
controller plugged into the system to see if there's any difference as
well.  So that's another hour of testing.

Anyway, hope this helps in some regard.

P.S. -- In the future, try to avoid cross-posting.  :-)

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                   Mountain View, CA, US |
| Making life hard for others since 1977.               PGP 4BD6C0CB |