Kernel deadlocks on 14.3-STABLE with 100GbE card

Reply: Paul : "Re: Kernel deadlocks on 14.3-STABLE with 100GbE card"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Paul <devgs_at_ukr.net>
Date: Tue, 29 Jul 2025 06:06:46 UTC
Hi!
It has been a 4th time now that our server had to be hard re-booted. Last two of them in the span of two hours.
It was only a week since the server was in production.
#uname -aKU
FreeBSD mpop-frv62.fwdcdn.com 14.3-STABLE FreeBSD 14.3-STABLE stable/14-n271907-7a96c75098af FRV amd64 1403502 1403502
We use a 100GbE card:
mlx5_core0@pci0:33:0:0: class=0x020000 rev=0x00 hdr=0x00 vendor=0x15b3 device=0x101d subvendor=0x15d9 subdevice=0x1c32
   vendor     = 'Mellanox Technologies'
   device     = 'MT2892 Family [ConnectX-6 Dx]'
   class      = network
   subclass   = ethernet
It all starts with one of the nginx processes stuck in 100% CPU consumption loop. Then, in the matter of minutes, more processes enter this
state and the server becomes completely unresponsive and has to be rebooted. During this time any attempt to elevate privileges to root
(eg sudo, su) simply leads to shell freezing and entering an unkillable state. Luckily we were able to capture kernel stack trace of one of 
such nginx processes. These are samples collected manually over the 'few-second' irregular intervals:
 PID    TID COMM                TDNAME              KSTACK
88996 105477 nginx               -                   sbflush+0x48 tcp_disconnect+0x63 tcp_usr_disconnect+0x77 soclose+0x75 _fdrop+0x11 closef+0x24a closefp_impl+0x58 amd64_syscall+0x117 fast_syscall_common+0xf8
# procstat -kk 88996
 PID    TID COMM                TDNAME              KSTACK
88996 105477 nginx               -                   sbflush+0x48 tcp_disconnect+0x63 tcp_usr_disconnect+0x77 soclose+0x75 _fdrop+0x11 closef+0x24a closefp_impl+0x58 amd64_syscall+0x117 fast_syscall_common+0xf8
# procstat -kk 88996
 PID    TID COMM                TDNAME              KSTACK
88996 105477 nginx               -                   sbflush+0x48 tcp_disconnect+0x63 tcp_usr_disconnect+0x77 soclose+0x75 _fdrop+0x11 closef+0x24a closefp_impl+0x58 amd64_syscall+0x117 fast_syscall_common+0xf8
# procstat -kk 88996
 PID    TID COMM                TDNAME              KSTACK
88996 105477 nginx               -                   sbflush+0x48 tcp_disconnect+0x63 tcp_usr_disconnect+0x77 soclose+0x75 _fdrop+0x11 closef+0x24a closefp_impl+0x58 amd64_syscall+0x117 fast_syscall_common+0xf8
# procstat -kk 88996
 PID    TID COMM                TDNAME              KSTACK
88996 105477 nginx               -                   tcp_disconnect+0x63 tcp_usr_disconnect+0x77 soclose+0x75 _fdrop+0x11 closef+0x24a closefp_impl+0x58 amd64_syscall+0x117 fast_syscall_common+0xf8
# procstat -kk 88996
 PID    TID COMM                TDNAME              KSTACK
88996 105477 nginx               -                   sbflush+0x48 tcp_disconnect+0x63 tcp_usr_disconnect+0x77 soclose+0x75 _fdrop+0x11 closef+0x24a closefp_impl+0x58 amd64_syscall+0x117 fast_syscall_common+0xf8
# procstat -kk 88996
 PID    TID COMM                TDNAME              KSTACK
88996 105477 nginx               -                   sbflush+0x48 tcp_disconnect+0x63 tcp_usr_disconnect+0x77 soclose+0x75 _fdrop+0x11 closef+0x24a closefp_impl+0x58 amd64_syscall+0x117 fast_syscall_common+0xf8
# procstat -kk 88996
 PID    TID COMM                TDNAME              KSTACK
88996 105477 nginx               -                   sbflush+0x48 tcp_disconnect+0x63 tcp_usr_disconnect+0x77 soclose+0x75 _fdrop+0x11 closef+0x24a closefp_impl+0x58 amd64_syscall+0x117 fast_syscall_common+0xf8
Seems that it exits sbflush() but then enters it again an does this in some endless loop.
# top
last pid: 55529;  load averages:  4.05,  4.15,  4.84                                                                                                                 up 3+23:02:37  18:12:59
3908 threads:  52 running, 2485 sleeping, 1357 waiting, 14 lock
CPU:  0.1% user,  0.0% nice,  4.3% system,  0.1% interrupt, 95.5% idle
Mem: 15G Active, 63G Inact, 154M Laundry, 575G Wired, 104K Buf, 95G Free
ARC: 512G Total, 219G MFU, 275G MRU, 822K Anon, 5059M Header, 13G Other
     446G Compressed, 926G Uncompressed, 2.08:1 Ratio

  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
88996     10      -60    0   292M    70M CPU46   46  17:13  99.49% nginx: worker process (nginx)
88995     10       23    0   292M    73M CPU33   33   4:13  99.49% nginx: worker process (nginx)
# netstat -m
3301610/61105/3362715 mbufs in use (current/cache/total)
3297555/20133/3317688/49014858 mbuf clusters in use (current/cache/total/max)
3294483/19963 mbuf+clusters out of packet secondary zone in use (current/cache)
1449/28019/29468/24507429 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/7261460 9k jumbo clusters in use (current/cache/total/max)
0/0/0/4084571 16k jumbo clusters in use (current/cache/total/max)
7426308K/167618K/7593926K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0 sendfile syscalls
0 sendfile syscalls completed without I/O request
0 requests for I/O initiated by sendfile
0 pages read by sendfile as part of a request
0 pages were valid at time of a sendfile request
0 pages were valid and substituted to bogus page
0 pages were requested for read ahead by applications
0 pages were read ahead by sendfile
0 times sendfile encountered an already busy page
0 requests for sfbufs denied
0 requests for sfbufs delayed
# cat /etc/sysctl.conf.local
security.jail.allow_raw_sockets=1
security.jail.sysvipc_allowed=1
security.jail.socket_unixiproute_only=0
security.jail.chflags_allowed=1
kern.threads.max_threads_per_proc=10000
kern.ipc.somaxconn=30000
kern.ipc.soacceptqueue=30000
kern.corefile=/tmp/%N-%P.core
kern.sugid_coredump=1
vm.swap_enabled=0
vm.v_free_target=10485760
net.inet.icmp.icmplim=20000
net.inet.tcp.delayed_ack=0
net.inet.tcp.nolocaltimewait=1
net.inet.tcp.fast_finwait2_recycle=1
net.inet.tcp.finwait2_timeout=3000
net.inet.tcp.msl=7500
net.inet.ip.portrange.randomized=0
net.inet.ip.portrange.first=1000
net.inet.udp.maxdgram=131072
net.inet.udp.recvspace=1048576
net.inet.tcp.sendbuf_max=67108864
net.inet.tcp.recvbuf_max=67108864
net.inet.tcp.recvspace=131072   # (default 65536)
net.inet.tcp.sendbuf_inc=65536  # (default 8192)
net.inet.tcp.sendspace=131072   # (default 32768)
net.inet.tcp.mssdflt=1460
net.inet.tcp.minmss=536
net.inet.tcp.syncache.rexmtlimit=0
### See https://reviews.freebsd.org/D20980 https://lists.freebsd.org/pipermail/freebsd-net/2019-July/053892.html
net.inet.tcp.ts_offset_per_conn=0
## PF now has max entries table count (default 65535) for tables https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=235076
## you can get error like: pfctl: Unknown error: -1.
net.pf.request_maxcount=1000000
kern.ipc.maxsockbuf=157286400  # (wscale 12)
# cat /boot/loader.conf
aesni_load="YES"
cryptodev_load="YES"
zfs_load="YES"
ipmi_load="YES"
if_lagg_load="YES"
cpuctl_load="YES"
amdtemp_load="YES"
kern.geom.label.disk_ident.enable="0"
kern.geom.label.gptid.enable="1"
net.inet.tcp.hostcache.cachelimit=32768
net.link.ifqmaxlen="2048"
##net.isr.maxthreads="-1"
net.isr.defaultqlimit="8192"
#net.isr.maxqlimit="40960"
kern.msgbufsize="262144"
vfs.zfs.arc_max="512G"
vfs.zfs.arc_min="486G"
We would appreciate your help or any suggestions on how to work around this issue.
We are open to any requests regarding additional data to be collected or kernel options to be tuned.