Re: FreeBSD panics possibly caused by nfs clients

From: Mark Johnston <markj_at_freebsd.org>
Date: Fri, 09 Feb 2024 16:04:30 UTC
On Thu, Feb 08, 2024 at 03:34:52PM +0000, Matthew L. Dailey wrote:
> Good morning all,
> 
> Per Rick Macklem's suggestion, I'm posting this query here in the hopes 
> that other may have ideas.
> 
> We did do some minimal testing with ufs around this problem back in 
> August, but hadn't narrowed the issue down to hdf5 workloads yet, so 
> testing was inconclusive. We'll do further testing on this to try to 
> rule in or out zfs as a contributing factor.
> 
> I'm happy to provide more technical details about this issue, but it is 
> quite complex to explain and reproduce.

It sounds like you've so far only tested with release kernels, is that
right?  To make some progress and narrow this down a bit, we'd want to
see testing results from a debug kernel.  Unfortunately we don't ship
pre-compiled debug kernels with releases, so you'll have to compile your
own, or test a snapshot of the development branch.  To do the former,
add the following lines to /usr/src/sys/amd64/conf/GENERIC and follow
the steps here: https://docs.freebsd.org/en/books/handbook/kernelconfig/#kernelconfig-building

options DDB
options INVARIANT_SUPPORT
options INVARIANTS
options QUEUE_MACRO_DEBUG_TRASH
options WITNESS
options WITNESS_SKIPSPIN

Since the problem appears to be some random memory corruption, I'd also
suggest using a KASAN(9) kernel in your test VM.  If the root cause of
the crashes is some kind of use-after-free or buffer overflow, KASAN has
a good chance of catching it.  Note that both debug kernels and KASAN
kernels have are significantly slower than release kernels, so you'll
want to deploy these in your test VM.  Once you do, and a panic occurs,
share the panic message and backtrace here, and we'll have a better idea
of where to start.

> Thanks in advance for any help!
> 
> Best,
> Matt
> 
> On 2/7/24 6:10 PM, Rick Macklem wrote:
>  >
>  > Well, there is certainly no obvious answer.
>  > One thing I would suggest is setting up a test
>  > server using UFS and see it if panics.
>  >
>  > To be honest, NFS is pretty simple when it comes
>  > to server side reading/writiing. It basically translates
>  > NFS RPCs to VOP calls on the underlying file system.
>  > As such, issues are usually either network fabric on one side
>  > (everything from cables to NIC drives and the TCP stack).
>  > Since you are seeing this with assorted NICs and FreeBSD
>  > versions, I doubt it is network fabric related, but??
>  > This is why I'd suspect the ZFS side and trying UFS would
>  > isolate the problem to ZFS.
>  >
>  > Although I know nothing about hdf5 files, the NFS code does
>  > nothing with the data (it is just a byte stream),
>  > Since ZFS does normally do things like compression, it could be
>  > affected by the data contents.
>  >
>  > Another common (although less common now) problem is TSO
>  > when data of certain sizes (often just less than 64K) is handled.
>  > However, this usually results in hangs and I have never heard
>  > of memory data corruption.
>  >
>  > Bottom line..determining if it is ZFS specific would be the best
>  > first step, I think?
>  >
>  > It would be good to post this to a mailing list like freebsd-current@,
>  > since others might have some insite into this. (Although you are
>  > not using freebsd-current@, that is where most developers read
>  > email.)
>  >
>  > rick
>  >
>  >
>  > On Wed, Feb 7, 2024 at 10:50 AM Matthew L. Dailey
>  > <Matthew.L.Dailey@dartmouth.edu> wrote:
>  >>
>  >>
>  >> Hi Rick,
>  >>
>  >> My name is Matt Dailey, and (among many other things), I run a FreeBSD
>  >> file server for the Thayer School of Engineering and the Department of
>  >> Computer Science here at Dartmouth College. We have run into a very odd
>  >> issue in which nfs4 Linux clients using hdf5 files are corrupting memory
>  >> and causing kernel panics on our FreeBSD server. The issue is very
>  >> complex to describe, and despite our diligent efforts, we have not been
>  >> able to replicate it in a simple scenario to report to FreeBSD
>  >> developers. In advance of filing an official bug report, I’m reaching
>  >> out in the hopes of having a discussion to get your guidance about how
>  >> best to proceed.
>  >>
>  >> The quick background is that we’ve been running a FreeBSD file server,
>  >> serving files from a zfs filesystem over kerberized nfs4 and samba for
>  >> almost 11 years, through 3 different generations of hardware and from
>  >> FreeBSD 9.1 up through 13.2. This system has historically been
>  >> wonderfully stable and robust.
>  >>
>  >> Beginning late in 2022, and then more regularly beginning in July of
>  >> 2023, we started experiencing kernel panics on our current system, then
>  >> running FreeBSD 13.0. They were seemingly random (mostly trap 12 and
>  >> trap 9) in random kernel functions, so we initially blamed hardware. We
>  >> replaced all RAM, moved to backup hardware, and even moved to an older,
>  >> retired system and the panics persisted. We have also upgraded from 13.0
>  >> to 13.2 and are currently at 13.2p5.
>  >>
>  >> After months of investigation, we finally narrowed down that these
>  >> panics were being caused by software on our Linux clients writing hdf5
>  >> files over nfs to the FreeBSD server. As near as we can tell from poring
>  >> through core dumps, something about how this nfs traffic is being
>  >> processed is corrupting kernel memory and then eventually a panic is
>  >> happening when some unsuspecting function reads the corrupted memory.
>  >> Since we have eliminated most known hdf5 workloads on our production
>  >> system, the panics have mostly ceased, and we suspect that the remaining
>  >> crashes could be from users still using hdf5.
>  >>
>  >> We have reproduced this issue with both krb5 and sys mounts, and up
>  >> through 13.2p9 and 14.0p4. All our testing has been using nfs 4.1.
>  >> Depending on conditions, panics sometimes occur within an hour or two,
>  >> or sometimes can take several days. On 14.0, the panics seems much less
>  >> prevalent, but still exist. With 13.x, it is easier to reproduce on
>  >> hardware than it is on a VM. With 14.0, it is the opposite. We have
>  >> panicked our test VM many times in the past two weeks, but have not yet
>  >> had a panic on test hardware. And, one of the panics on the VM actually
>  >> corrupted the root zpool such that it could not be imported or 
> recovered!
>  >>
>  >> I’m contacting you directly as you seem to be “the guy” when it comes to
>  >> all things nfs on FreeBSD. After hundreds of hours investigating, I’m
>  >> left with the feeling that a bug report describing these complex
>  >> scenarios will be extremely unappetizing to sink one’s teeth into. It
>  >> feels like a discussion would accomplish more. We would be able to
>  >> answer questions to describe this complex issue with the nuance you
>  >> need, and you might have ideas for how we could help narrow things down
>  >> further for the benefit of both FreeBSD and Dartmouth.
>  >>
>  >> We currently are serving about 400TB of data to faculty, staff,
>  >> researchers and students (hundreds of users) and are eager to get this
>  >> into the hands of folks that can help diagnose and fix this issue.
>  >>
>  >> Thanks so much for any help you can provide.
>  >>
>  >> Best,
>  >> Matt
>  >>
>  >> --
>  >> Matthew L. Dailey
>  >> Enterprise Systems Engineer
>  >> Thayer School of Engineering
>  >> Dartmouth College
>  >> +1 (603) 646-2760