[Bug 260664] FreeBSD randomly freeze or crash with nfs mount after some days or a month.

From: <bugzilla-noreply_at_freebsd.org>
Date: Fri, 24 Dec 2021 14:41:54 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=260664

            Bug ID: 260664
           Summary: FreeBSD randomly freeze or crash with nfs mount after
                    some days or a month.
           Product: Base System
           Version: 13.0-RELEASE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: ypnow@163.com

Hello everyone, I have a server running on FreeBSD 13.0, the server randomly
freeze after some days or a month.

Here is the phenomenon:
1. Unable to connect to the ssh, when input ssh command, no any response.
2. Alot of services can not be visit, some simple service like static nginx
page can be opened in a short time, but if you refreshed page some times, the
page will be stuck, and have no response, some other services is the same.
3. Another server has a always logged ssh to FreeBSD Server, and opened a top
command, when FreeBSD freeze, this ssh can still visit and top command can
refresh and output system status,the memory is normal, cpu usage is normal, ZFS
ARC is normal, swap is normal, clock is normal, looks like anything is normal.
Any hot key for top can use, but when press q to quit top, and type other
command, like "systat -ifstat", the command stuck, no any output, Ctrl + Z or C
no response.
4. Ping server always normal.
5. The redis-server on freebsd is normal, because redis service can response
and very good.
6. Unable to login from console, when type username and password, press enter,
no any output.

Environment:
FreeBSD 13.0,Intel Xeon 4Core + 16GB Memory, Two 2T Disk, ZFS Mirror, Root on
ZFS. It's a new machine, it's been less than half a year since we bought it.
Main system only running sshguard+ipfw, mount a nfs and use nullfs to a jail,
jail file system running on zfs dataset clone, services all running in this
jail.

Server has two bge network interface, one for lan, one for wan, the services is
network heavy service.
In jail, running nginx, php-fpm, php cli server, mysql, redis-server, there is
alot of nfs write, read by php.

Some try:
At first it was suspected to be a ZFS ARC problem, and I set arc max to 2G, but
in top ARC is very normal..
When look at dmesg, or any log by system or services, every log stopped record
when system freeze, means there is no any abnormal log.. but looks like some
service that no need read or write file is normal.

Some try 2:
Before configure kern.ipc.somaxconn, when system hang, I can't login system,
can't do any operate. But after change somaxconn, when system hang (worker
processes freeze), I can still login to system, and do some operate that not
touch the NFS mountpoint, I found that freeze because of nfs mount is dead.

I found some same problems:
https://emby.media/community/index.php?/topic/74175-freebsd-jail-with-nfsv4-share-causes-system-to-hang/
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=251347
https://redmine.ixsystems.com/issues/2068

Some try 3:
I have move my program that read/write nfs out of jail, add intr to nfs mount
options, more system crash happened, and every crash stack trace is different,
sometimes is arc_write, or nfscl, sometime is zio_execute, I now suspect that
the possibility of hardware failure is quite high.


- uname -a

FreeBSD ppbsd 13.0-RELEASE-p4 FreeBSD 13.0-RELEASE-p4 #0: Tue Aug 24 07:33:27
UTC 2021    
root@amd64-builder.daemonology.net:/usr/obj/usr/src/amd64.amd64/sys/GENERIC 
amd64

- loader.conf

kern.geom.label.disk_ident.enable="0"
kern.geom.label.gptid.enable="0"
cryptodev_load="YES"
zfs_load="YES"
coretemp_load="YES"
net.inet.ip.fw.default_to_accept=1
vfs.zfs.arc_max="2G"

# Increase dmesg buffer to fit longer boot output.
kern.msgbufsize="524288"

- sysctl.conf

# $FreeBSD$
#
#  This file is read when going to multi-user and its contents piped thru
#  ``sysctl'' to adjust kernel values.  ``man 5 sysctl.conf'' for details.
#

# Uncomment this to prevent users from seeing information about processes that
# are being run under another UID.
#security.bsd.see_other_uids=0
#vfs.zfs.min_auto_ashift=12
kern.ipc.somaxconn=4096

Here is the crash log 1:

Dec 17 00:47:02 ppbsd kernel: Fatal trap 12: page fault while in kernel mode
Dec 17 00:47:02 ppbsd kernel: cpuid = 2; apic id = 04
Dec 17 00:47:02 ppbsd kernel: fault virtual address     = 0x28
Dec 17 00:47:02 ppbsd kernel: fault code                = supervisor read data,
page not present
Dec 17 00:47:02 ppbsd kernel: instruction pointer       =
0x20:0xffffffff821495f8
Dec 17 00:47:02 ppbsd kernel: stack pointer             =
0x0:0xfffffe010fae48d0
Dec 17 00:47:02 ppbsd kernel: frame pointer             =
0x0:0xfffffe010fae48d0
Dec 17 00:47:02 ppbsd kernel: code segment              = base 0x0, limit
0xfffff, type 0x1b
Dec 17 00:47:02 ppbsd kernel:                   = DPL 0, pres 1, long 1, def32
0, gran 1
Dec 17 00:47:02 ppbsd kernel: processor eflags  = interrupt enabled, resume,
IOPL = 0
Dec 17 00:47:02 ppbsd kernel: current process           = 0 (z_wr_int_3)
Dec 17 00:47:02 ppbsd kernel: trap number               = 12
Dec 17 00:47:02 ppbsd kernel: panic: page fault
Dec 17 00:47:02 ppbsd kernel: cpuid = 2
Dec 17 00:47:02 ppbsd kernel: time = 1639673010
Dec 17 00:47:02 ppbsd kernel: KDB: stack backtrace:
Dec 17 00:47:02 ppbsd kernel: #0 0xffffffff80c574c5 at kdb_backtrace+0x65
Dec 17 00:47:02 ppbsd kernel: #1 0xffffffff80c09ea1 at vpanic+0x181
Dec 17 00:47:02 ppbsd kernel: #2 0xffffffff80c09d13 at panic+0x43
Dec 17 00:47:02 ppbsd kernel: #3 0xffffffff8108b1b7 at trap_fatal+0x387
Dec 17 00:47:02 ppbsd kernel: #4 0xffffffff8108b20f at trap_pfault+0x4f
Dec 17 00:47:02 ppbsd kernel: #5 0xffffffff8108a86d at trap+0x27d
Dec 17 00:47:02 ppbsd kernel: #6 0xffffffff81061958 at calltrap+0x8
Dec 17 00:47:02 ppbsd kernel: #7 0xffffffff821a4d3e at dbuf_write_done+0x9e
Dec 17 00:47:02 ppbsd kernel: #8 0xffffffff82190c5c at arc_write_done+0x33c
Dec 17 00:47:02 ppbsd kernel: #9 0xffffffff822f920d at zio_done+0xd9d
Dec 17 00:47:02 ppbsd kernel: #10 0xffffffff822f2d5c at zio_execute+0x3c
Dec 17 00:47:02 ppbsd kernel: #11 0xffffffff80c6b161 at
taskqueue_run_locked+0x181
Dec 17 00:47:02 ppbsd kernel: #12 0xffffffff80c6c47c at
taskqueue_thread_loop+0xac
Dec 17 00:47:02 ppbsd kernel: #13 0xffffffff80bc7dde at fork_exit+0x7e
Dec 17 00:47:02 ppbsd kernel: #14 0xffffffff810629de at fork_trampoline+0xe
Dec 17 00:47:02 ppbsd kernel: Uptime: 1h44m58s
Dec 17 00:47:02 ppbsd kernel: Dumping 2511 out of 16190
MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%---<<BOOT>>---

Crash log 2:

Dec 21 14:09:50 ppbsd syslogd: kernel boot file is /boot/kernel/kernel
Dec 21 14:09:50 ppbsd kernel: 
Dec 21 14:09:50 ppbsd syslogd: last message repeated 1 times
Dec 21 14:09:50 ppbsd kernel: Fatal trap 12: page fault while in kernel mode
Dec 21 14:09:50 ppbsd kernel: cpuid = 2; apic id = 04
Dec 21 14:09:50 ppbsd kernel: fault virtual address     = 0x0
Dec 21 14:09:50 ppbsd kernel: fault code                = supervisor write
data, page not present
Dec 21 14:09:50 ppbsd kernel: instruction pointer       =
0x20:0xffffffff80ac9e26
Dec 21 14:09:50 ppbsd kernel: stack pointer             =
0x28:0xfffffe011bf165b0
Dec 21 14:09:50 ppbsd kernel: frame pointer             =
0x28:0xfffffe011bf165f0
Dec 21 14:09:50 ppbsd kernel: code segment              = base 0x0, limit
0xfffff, type 0x1b
Dec 21 14:09:50 ppbsd kernel:                   = DPL 0, pres 1, long 1, def32
0, gran 1
Dec 21 14:09:50 ppbsd kernel: processor eflags  = interrupt enabled, resume,
IOPL = 0
Dec 21 14:09:50 ppbsd kernel: current process           = 4541 (newnfs 3)
Dec 21 14:09:50 ppbsd kernel: trap number               = 12
Dec 21 14:09:50 ppbsd kernel: panic: page fault
Dec 21 14:09:50 ppbsd kernel: cpuid = 2
Dec 21 14:09:50 ppbsd kernel: time = 1640066765
Dec 21 14:09:50 ppbsd kernel: KDB: stack backtrace:
Dec 21 14:09:50 ppbsd kernel: #0 0xffffffff80c574c5 at kdb_backtrace+0x65
Dec 21 14:09:50 ppbsd kernel: #1 0xffffffff80c09ea1 at vpanic+0x181
Dec 21 14:09:50 ppbsd kernel: #2 0xffffffff80c09d13 at panic+0x43
Dec 21 14:09:50 ppbsd kernel: #3 0xffffffff8108b1b7 at trap_fatal+0x387
Dec 21 14:09:50 ppbsd kernel: #4 0xffffffff8108b20f at trap_pfault+0x4f
Dec 21 14:09:50 ppbsd kernel: #5 0xffffffff8108a86d at trap+0x27d
Dec 21 14:09:50 ppbsd kernel: #6 0xffffffff81061958 at calltrap+0x8
Dec 21 14:09:50 ppbsd kernel: #7 0xffffffff80acc5d9 at nfscl_hasexpired+0x709
Dec 21 14:09:50 ppbsd kernel: #8 0xffffffff80add066 at nfsrpc_read+0x316
Dec 21 14:09:50 ppbsd kernel: #9 0xffffffff80aee349 at ncl_readrpc+0x89
Dec 21 14:09:50 ppbsd kernel: #10 0xffffffff80b01443 at ncl_doio+0xe3
Dec 21 14:09:50 ppbsd kernel: #11 0xffffffff80b03b32 at nfssvc_iod+0x232
Dec 21 14:09:50 ppbsd kernel: #12 0xffffffff80bc7dde at fork_exit+0x7e
Dec 21 14:09:50 ppbsd kernel: #13 0xffffffff810629de at fork_trampoline+0xe
Dec 21 14:09:50 ppbsd kernel: Uptime: 2d17h24m52s
Dec 21 14:09:50 ppbsd kernel: Dumping 3243 out of 16190
MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%---<<BOOT>>---

Crash log 3:

Dec 22 16:30:19 ppbsd syslogd: kernel boot file is /boot/kernel/kernel
Dec 22 16:30:19 ppbsd kernel: 
Dec 22 16:30:19 ppbsd syslogd: last message repeated 1 times
Dec 22 16:30:19 ppbsd kernel: Fatal trap 12: page fault while in kernel mode
Dec 22 16:30:19 ppbsd kernel: cpuid = 2; apic id = 04
Dec 22 16:30:19 ppbsd kernel: fault virtual address     = 0x2000
Dec 22 16:30:19 ppbsd kernel: fault code                = supervisor read
instruction, page not present
Dec 22 16:30:19 ppbsd kernel: instruction pointer       = 0x20:0x2000
Dec 22 16:30:19 ppbsd kernel: stack pointer             =
0x28:0xfffffe010f9a4998
Dec 22 16:30:19 ppbsd kernel: frame pointer             =
0x28:0xfffffe010f9a49d0
Dec 22 16:30:19 ppbsd kernel: code segment              = base 0x0, limit
0xfffff, type 0x1b
Dec 22 16:30:19 ppbsd kernel:                   = DPL 0, pres 1, long 1, def32
0, gran 1
Dec 22 16:30:19 ppbsd kernel: processor eflags  = interrupt enabled, resume,
IOPL = 0
Dec 22 16:30:19 ppbsd kernel: current process           = 0 (z_wr_int_2)
Dec 22 16:30:19 ppbsd kernel: trap number               = 12
Dec 22 16:30:19 ppbsd kernel: panic: page fault
Dec 22 16:30:19 ppbsd kernel: cpuid = 2
Dec 22 16:30:19 ppbsd kernel: time = 1640161595
Dec 22 16:30:19 ppbsd kernel: KDB: stack backtrace:
Dec 22 16:30:19 ppbsd kernel: #0 0xffffffff80c574c5 at kdb_backtrace+0x65
Dec 22 16:30:19 ppbsd kernel: #1 0xffffffff80c09ea1 at vpanic+0x181
Dec 22 16:30:19 ppbsd kernel: #2 0xffffffff80c09d13 at panic+0x43
Dec 22 16:30:19 ppbsd kernel: #3 0xffffffff8108b1b7 at trap_fatal+0x387
Dec 22 16:30:19 ppbsd kernel: #4 0xffffffff8108b20f at trap_pfault+0x4f
Dec 22 16:30:19 ppbsd kernel: #5 0xffffffff8108a86d at trap+0x27d
Dec 22 16:30:19 ppbsd kernel: #6 0xffffffff81061958 at calltrap+0x8
Dec 22 16:30:19 ppbsd kernel: #7 0xffffffff822e2d5c at zio_execute+0x3c
Dec 22 16:30:19 ppbsd kernel: #8 0xffffffff80c6b161 at
taskqueue_run_locked+0x181
Dec 22 16:30:19 ppbsd kernel: #9 0xffffffff80c6c47c at
taskqueue_thread_loop+0xac
Dec 22 16:30:19 ppbsd kernel: #10 0xffffffff80bc7dde at fork_exit+0x7e
Dec 22 16:30:19 ppbsd kernel: #11 0xffffffff810629de at fork_trampoline+0xe
Dec 22 16:30:19 ppbsd kernel: Uptime: 23h2m59s
Dec 22 16:30:19 ppbsd kernel: Dumping 3138 out of 16190
MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%---<<BOOT>>---

-- 
You are receiving this mail because:
You are the assignee for the bug.