[Bug 261690] NFSv4 mount on Linux client hangs during complex access patterns (gcc bootstrapping on client)

From: <bugzilla-noreply_at_freebsd.org>
Date: Tue, 08 Feb 2022 05:42:18 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=261690

--- Comment #2 from Joshua Kinard <kumba@gentoo.org> ---
This is a curiously apt find.  I've got two old MIPS systems, same machine
types (SGI Octanes), running a Linux-5.4 LTS kernel w/ custom patches and a
Gentoo userland, which mount the Gentoo "portage" tree (similar to Ports) and
several other common folders over NFS4.2.  The NFS server is my NAS running
FreeBSD 13.0-RELEASE-p7 and the exports are on a ZFS file system.

The FreeBSD kernel on the NAS box is a custom configuration and has these
patches from Phab applied on top:
  - https://reviews.freebsd.org/D18985
  - https://reviews.freebsd.org/D29088
  - https://reviews.freebsd.org/D29315
  - https://reviews.freebsd.org/D29772
  - https://reviews.freebsd.org/D29838
  - https://reviews.freebsd.org/D30318
  - https://reviews.freebsd.org/D32724

As well as these patches listed from Bugzilla PRs (some cherry-picked before
their bugs were fixed and closed):
  - Bug #254560
  - Bug #254590
  - Bug #256280
  - Bug #260293
  - Bug #260375
  - Bug #260884

When either of those two MIPS machines build gcc-11.2.1 snapshots, or even
sometimes recent glibc, there's been reasonable probability that they will
crash with an "Oops" (type of Linux kernel panic).  The cause of the oops is
tangential to the issue as far as I can tell, because what really happens is
cc1plus will hit an invalid memory access attempt while compiling, which
triggers the Linux/mips page faulting code on the first CPU, and while that
goes on, the second CPU tries running an interrupt handler to update process
ticks, which causes the kernel to attempt to dereference a NULL and thus, oops
the machine.  The machine isn't totally dead, though, which is pretty weird,
cause an oops usually kills Linux.  The machine will still respond to pings
(intermittently) and SSH sessions will remain connected, just not respond to
commands.

For the last few weeks, I have been scratching my head at the oops data, and
nothing makes sense about it.  This bug report, however, does.  Or, at least,
it's the best find I've come across so far.  Many of the characteristics
described in the original report match my setup (FreeBSD 13 on the NFS server,
Linux 5.4.x on the client, NFS4.x mounts, compiling gcc/cc1plus, dead-ending in
the Linux kernel scheduler path, etc).

I am currently trying to port the MIPS kernel for these machines up to the next
Linux LTS release (5.10) to see if that changes anything.

I tried running the Perl script on the MIPS machine, and on the first run, it
triggered a page fault and threw a SIGSEGV due to memory exhaustion (2GB RAM in
each machine).  But in multiple subsequent runs, the Perl script finished (I
think), when it stopped at 226 threads before claiming it was out of memory. 
Could not get the machine to oops in the same way gcc/cc1plus does.

Thing is, I've been running this kind of a setup for well over a year.  The two
MIPS machines have been on a 5.4 kernel for at least the last six months, and
up until about three weeks ago, all seemed fine.  Which kinda suggests the
fault may really be on the Linux-side of things.

I'll also add that unlike the original report, both MIPS machines run the
actual gcc compile on a folder on the local disk via a bind mount.  During the
compile, there shouldn't be a whole lot of NFS chatter because the way Portage
works, all of the needed package data gets saved to the build directory on the
local disk.  But I can't rule out that something is still slightly wacky with
periodic NFS commands between the client and the server causing an issue while
the machine is under stress compiling gcc.

I will have to go back through recent 5.4 stable releases and look for any
recent commits for Linux NFS4.x client code to see if that could explain
things.  But I figured I'd describe my scenario here as well in case it offers
any clues.

-- 
You are receiving this mail because:
You are the assignee for the bug.