[Bug 293382] Dead lock and kernel crash around closefp_impl

From: <bugzilla-noreply_at_freebsd.org>
Date: Tue, 14 Apr 2026 09:59:12 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=293382

--- Comment #69 from Paul <devgs@ukr.net> ---
Hi!

Unfortunately, it is still crashing. Given the input from other people above,
could it be that we're looking for an issue in the wrong place? What if the
issue is actually related to some race with file descriptors, and not the
kqueue at all? File descriptors have a known `feature` of being reused. What if
there is some concurrent (unprotected by lock) `close`+`fd-allocation`, messing
things up? Or just a simple ABA problem, related to FDs being of equal value
but representing different entities in A and B-->A. Modern CPUS may just be
faster or just have higher IPC making the hidden issue reproducible.

Still hard to believe this is a hardware issue, given that it then should have
occurred in a LOT more places, if true. Like in user-space. We use hash tables
and vectors, that do the rebalancing (and growth) occassionally, leading to
multi-megabyte `memcpy` copies. Then, there are other OSes, like Linux that is
objectively more popular and statistically have more of this hardware running
in the wild. Was unable to google anything related to Linux, apart from RDSEED.

Lately, I'm asking myself: why is this place, the kqueue, somehow special? We
know that even small ECC failures inevitably lead to system crashes (we have
experiecnced the bad memory in the past): statistically, a bit flip will
inevitable lead to the butterfly effect, somewhere, especially in the kernel.
And here it's definitely more that just a singe bit flip. Still, it manifests
itself, basically, in the same place again and again. Is this because that's
the only place where kernel does memcpy of this large number of bytes?

Sorry for the rant. And also, sorry for being such a parrot, but... 538163 −
268851 = 263 * 1024. Here's the info:

Unread portion of the kernel message buffer:
panic: Assertion kn->kn_kq == kq failed at /usr/src/sys/kern/kern_event.c:1743
cpuid = 11
time = 1776154383
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe069b3f0820
vpanic() at vpanic+0x136/frame 0xfffffe069b3f0950
panic() at panic+0x43/frame 0xfffffe069b3f09b0
kqueue_register() at kqueue_register+0x6be/frame 0xfffffe069b3f0a30
kqueue_kevent() at kqueue_kevent+0xc9/frame 0xfffffe069b3f0c90
kern_kevent_fp() at kern_kevent_fp+0x9b/frame 0xfffffe069b3f0ce0
kern_kevent() at kern_kevent+0x82/frame 0xfffffe069b3f0d40
kern_kevent_generic() at kern_kevent_generic+0x70/frame 0xfffffe069b3f0da0
sys_kevent() at sys_kevent+0x61/frame 0xfffffe069b3f0e00
amd64_syscall() at amd64_syscall+0x169/frame 0xfffffe069b3f0f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe069b3f0f30
--- syscall (560, FreeBSD ELF64, kevent), rip = 0x82dabd3ea, rsp = 0x85da7aad8,
rbp = 0x85da7abc0 ---
KDB: enter: panic

(kgdb) bt
#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57
#1  doadump (textdump=0) at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff804bd798 in db_fncall_generic (nargs=0, args=0xfffffe069b3f0240,
addr=<optimized out>, rv=<optimized out>) at /usr/src/sys/ddb/db_command.c:631
#3  db_fncall (dummy1=<optimized out>, dummy2=<optimized out>,
dummy3=<optimized out>, dummy4=<optimized out>) at
/usr/src/sys/ddb/db_command.c:679
#4  0xffffffff804bd21d in db_command (last_cmdp=<optimized out>,
cmd_table=<optimized out>, dopager=false) at /usr/src/sys/ddb/db_command.c:508
#5  0xffffffff804bd366 in db_command_script
(command=command@entry=0xffffffff81bd7722 <db_recursion_data+18> "call
doadump") at /usr/src/sys/ddb/db_command.c:573
#6  0xffffffff804c3148 in db_script_exec
(scriptname=scriptname@entry=0xfffffe069b3f0410 "kdb.enter.panic",
warnifnotfound=warnifnotfound@entry=0) at /usr/src/sys/ddb/db_script.c:301
#7  0xffffffff804c3042 in db_script_kdbenter (eventname=<optimized out>) at
/usr/src/sys/ddb/db_script.c:323
#8  0xffffffff804c08d1 in db_trap (type=<optimized out>, code=<optimized out>)
at /usr/src/sys/ddb/db_main.c:266
#9  0xffffffff80c2c86f in kdb_trap (type=type@entry=3, code=code@entry=0,
tf=tf@entry=0xfffffe069b3f0760) at /usr/src/sys/kern/subr_kdb.c:790
#10 0xffffffff8113a8ed in trap (frame=<optimized out>) at
/usr/src/sys/amd64/amd64/trap.c:697
#11 <signal handler called>
#12 kdb_enter (why=<optimized out>, msg=<optimized out>) at
/usr/src/sys/kern/subr_kdb.c:556
#13 0xffffffff80bd97eb in vpanic (fmt=0xffffffff812fcb51 "Assertion %s failed
at %s:%d", ap=ap@entry=0xfffffe069b3f0990) at
/usr/src/sys/kern/kern_shutdown.c:962
#14 0xffffffff80bd9653 in panic (fmt=0xffffffff81da22b0 <cnputs_mtx>
"\026\217\"\201\377\377\377\377") at /usr/src/sys/kern/kern_shutdown.c:887
#15 0xffffffff80b78d1e in kqueue_register (kq=kq@entry=0xff0100010b64ae00,
kev=kev@entry=0xfffffe069b3f0a40, td=td@entry=0xff010002d36ca000,
mflag=mflag@entry=2) at /usr/src/sys/kern/kern_event.c:1743
#16 0xffffffff80b79f89 in kqueue_kevent (kq=kq@entry=0xff0100010b64ae00,
td=td@entry=0xff010002d36ca000, nchanges=nchanges@entry=1,
nevents=nevents@entry=0, k_ops=k_ops@entry=0xfffffe069b3f0de0,
timeout=timeout@entry=0x0) at /usr/src/sys/kern/kern_event.c:1509
#17 0xffffffff80b79e4b in kern_kevent_fp (td=td@entry=0xff010002d36ca000,
fp=<optimized out>, nchanges=nchanges@entry=1, nevents=nevents@entry=0,
k_ops=k_ops@entry=0xfffffe069b3f0de0, timeout=timeout@entry=0x0) at
/usr/src/sys/kern/kern_event.c:1540
#18 0xffffffff80b79d62 in kern_kevent (td=td@entry=0xff010002d36ca000,
fd=<optimized out>, nchanges=1, nevents=0,
k_ops=k_ops@entry=0xfffffe069b3f0de0, timeout=timeout@entry=0x0) at
/usr/src/sys/kern/kern_event.c:1480
#19 0xffffffff80b79a60 in kern_kevent_generic (td=0xff010002d36ca000,
uap=uap@entry=0xfffffe069b3f0db0, k_ops=k_ops@entry=0xfffffe069b3f0de0,
struct_name=0xffffffff8131134d "kevent") at /usr/src/sys/kern/kern_event.c:1336
#20 0xffffffff80b79951 in sys_kevent (td=0xffffffff81da22b0 <cnputs_mtx>,
uap=<optimized out>) at /usr/src/sys/kern/kern_event.c:1309
#21 0xffffffff8113b729 in syscallenter (td=0xff010002d36ca000) at
/usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193
#22 amd64_syscall (td=0xff010002d36ca000, traced=0) at
/usr/src/sys/amd64/amd64/trap.c:1267
#23 <signal handler called>
#24 0x000000082dabd3ea in ?? ()
Backtrace stopped: Cannot access memory at address 0x85da7aad8
(kgdb) fr 15
#15 0xffffffff80b78d1e in kqueue_register (kq=kq@entry=0xff0100010b64ae00,
kev=kev@entry=0xfffffe069b3f0a40, td=td@entry=0xff010002d36ca000,
mflag=mflag@entry=2) at /usr/src/sys/kern/kern_event.c:1743
1743                                    MPASS(kn->kn_kq == kq);
(kgdb) p *kq
$1 = {
  kq_lock = {
    lock_object = {
      lo_name = 0xffffffff8134f841 "kqueue",
      lo_flags = 21168128,
      lo_data = 0,
      lo_witness = 0xff0100804bd8db80
    },
    mtx_lock = 18374967966785380352
  },
  kq_refcnt = 1,
  kq_list = {
    tqe_next = 0xff010002070f7400,
    tqe_prev = 0xff010001ed1d8128
  },
  kq_head = {
    tqh_first = 0xff0100383fb26000,
    tqh_last = 0xff0100383fb26018
  },
  kq_count = 1,
  kq_sel = {
    si_tdlist = {
      tqh_first = 0x0,
      tqh_last = 0x0
    },
    si_note = {
      kl_list = {
        slh_first = 0x0
      },
      kl_lock = 0xffffffff80b7ad20 <knlist_mtx_lock>,
      kl_unlock = 0xffffffff80b7ad40 <knlist_mtx_unlock>,
      kl_assert_lock = 0xffffffff80b7ad60 <knlist_mtx_assert_lock>,
      kl_lockarg = 0xff0100010b64ae00,
      kl_autodestroy = 0
    },
    si_mtx = 0x0
  },
  kq_sigio = 0x0,
  kq_fdp = 0xfffffe0698e57c90,
  kq_state = 0,
  kq_knlistsize = 661248,
  kq_knlist = 0xfffffe0a430de000,
  kq_knhashmask = 0,
  kq_knhash = 0x0,
  kq_task = {
    ta_link = {
      stqe_next = 0x0
    },
    ta_pending = 0,
    ta_priority = 0 '\000',
    ta_flags = 0 '\000',
    ta_func = 0xffffffff80b7d500 <kqueue_task>,
    ta_context = 0xff0100010b64ae00
  },
  kq_cred = 0xff010001f3643d80,
  kq_forksrc = 0x0
}
(kgdb) p *((struct eknote*)kn)
No struct type named eknote.
(kgdb) p *kn
$2 = {
  kn_link = {
    sle_next = 0x0
  },
  kn_selnext = {
    sle_next = 0x0
  },
  kn_knlist = 0xff01006ea9f9d838,
  kn_tqe = {
    tqe_next = 0xffffffffffffffff,
    tqe_prev = 0xffffffffffffffff
  },
  kn_kq = 0xff010002c5175c00,
  kn_kevent = {
    ident = 538163,
    filter = -1,
    flags = 32,
    fflags = 0,
    data = 0,
    udata = 0x32b8c7fc9400,
    ext = {0, 0, 0, 0}
  },
  kn_hook = 0x0,
  kn_hookid = 0,
  kn_status = 0,
  kn_influx = 0,
  kn_sfflags = 0,
  kn_sdata = 0,
  kn_ptr = {
    p_fp = 0xff01006da1fcbb90,
    p_proc = 0xff01006da1fcbb90,
    p_aio = 0xff01006da1fcbb90,
    p_lio = 0xff01006da1fcbb90,
    p_prison = 0xff01006da1fcbb90,
    p_v = 0xff01006da1fcbb90
  },
  kn_fop = 0xffffffff814e7120 <soread_filtops>
}
(kgdb) fr 18
#18 0xffffffff80b79d62 in kern_kevent (td=td@entry=0xff010002d36ca000,
fd=<optimized out>, nchanges=1, nevents=0,
k_ops=k_ops@entry=0xfffffe069b3f0de0, timeout=timeout@entry=0x0) at
/usr/src/sys/kern/kern_event.c:1480
1480            error = kern_kevent_fp(td, fp, nchanges, nevents, k_ops,
timeout);
(kgdb) p fd
$3 = <optimized out>
(kgdb) up
#19 0xffffffff80b79a60 in kern_kevent_generic (td=0xff010002d36ca000,
uap=uap@entry=0xfffffe069b3f0db0, k_ops=k_ops@entry=0xfffffe069b3f0de0,
struct_name=0xffffffff8131134d "kevent") at /usr/src/sys/kern/kern_event.c:1336
1336            error = kern_kevent(td, uap->fd, uap->nchanges, uap->nevents,
(kgdb) p *uap
$4 = {
  fd = 17,
  changelist = 0x85da7aae0,
  nchanges = 1,
  eventlist = 0x0,
  nevents = 0,
  timeout = 0x0
}
(kgdb) fr 15
#15 0xffffffff80b78d1e in kqueue_register (kq=kq@entry=0xff0100010b64ae00,
kev=kev@entry=0xfffffe069b3f0a40, td=td@entry=0xff010002d36ca000,
mflag=mflag@entry=2) at /usr/src/sys/kern/kern_event.c:1743
1743                                    MPASS(kn->kn_kq == kq);
(kgdb) p *kev
$5 = {
  ident = 268851,
  filter = -1,
  flags = 2,
  fflags = 0,
  data = 0,
  udata = 0x0,
  ext = {0, 0, 0, 0}
}
(kgdb) p kn->kn_kq
$6 = (struct kqueue *) 0xff010002c5175c00
(kgdb) p kq
$7 = (struct kqueue *) 0xff0100010b64ae00
(kgdb) p *kn->kn_kq
$8 = {
  kq_lock = {
    lock_object = {
      lo_name = 0xffffffff8134f841 "kqueue",
      lo_flags = 21168128,
      lo_data = 0,
      lo_witness = 0xff0100804bd8db80
    },
    mtx_lock = 0
  },
  kq_refcnt = 1,
  kq_list = {
    tqe_next = 0x0,
    tqe_prev = 0xff010002070f7428
  },
  kq_head = {
    tqh_first = 0x0,
    tqh_last = 0xff010002c5175c38
  },
  kq_count = 0,
  kq_sel = {
    si_tdlist = {
      tqh_first = 0x0,
      tqh_last = 0x0
    },
    si_note = {
      kl_list = {
        slh_first = 0x0
      },
      kl_lock = 0xffffffff80b7ad20 <knlist_mtx_lock>,
      kl_unlock = 0xffffffff80b7ad40 <knlist_mtx_unlock>,
      kl_assert_lock = 0xffffffff80b7ad60 <knlist_mtx_assert_lock>,
      kl_lockarg = 0xff010002c5175c00,
      kl_autodestroy = 0
    },
    si_mtx = 0x0
  },
  kq_sigio = 0x0,
  kq_fdp = 0xfffffe0698e57c90,
  kq_state = 2,
  kq_knlistsize = 661248,
  kq_knlist = 0xfffffe0a769cc000,
  kq_knhashmask = 0,
  kq_knhash = 0x0,
  kq_task = {
    ta_link = {
      stqe_next = 0x0
    },
    ta_pending = 0,
    ta_priority = 0 '\000',
    ta_flags = 0 '\000',
    ta_func = 0xffffffff80b7d500 <kqueue_task>,
    ta_context = 0xff010002c5175c00
  },
  kq_cred = 0xff010001f3643d80,
  kq_forksrc = 0x0
}
(kgdb) p kq->kq_knlist[kev->ident]
$9 = {
  slh_first = 0xff01005e2a3fe5a0
}
(kgdb) p kn
$12 = (struct knote *) 0xff01005e2a3fe5a0


Please, tell us, if you need anything else.

By the way, as the RAM sizes were already mentioned above, it would be
appropriate to share ours. That specific server has 8 x 64GB = 512 GB RAM. We
have servers that do a lot more I/O and memory manipulations, including
`memcpy` of more than 10MiB, sometimes in the hundreds of MiB, utilizing 12 x
64 GB = 768 GB, without exhibiting any issues whatsoever. All of them have the
same CPU.

-- 
You are receiving this mail because:
You are the assignee for the bug.