Re: _mtx_lock_sleep: recursed on non-recursive mutex CAM device lock @ /..../sys/cam/nvme/nvme_da.c:469
- Reply: Alexander Leidinger : "Re: _mtx_lock_sleep: recursed on non-recursive mutex CAM device lock @ /..../sys/cam/nvme/nvme_da.c:469"
- In reply to: Alexander Leidinger : "_mtx_lock_sleep: recursed on non-recursive mutex CAM device lock @ /..../sys/cam/nvme/nvme_da.c:469"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Wed, 22 May 2024 18:53:04 UTC
First order: Looks like we're trying to schedule a trim, but that fails due to a malloc issue. So then, since it's a malloc issue, we wind up trying to automatically reschedule this I/O, which recurses into the driver with a bad lock held and boop. Can you reproduce this? If so, can you test this patch? diff --git a/sys/cam/nvme/nvme_da.c b/sys/cam/nvme/nvme_da.c index 3f6cf8702870..357e612200e9 100644 --- a/sys/cam/nvme/nvme_da.c +++ b/sys/cam/nvme/nvme_da.c @@ -1077,7 +1077,9 @@ ndastart(struct cam_periph *periph, union ccb *start_ccb) trim = malloc(sizeof(*trim), M_NVMEDA, M_ZERO | M_NOWAIT); if (trim == NULL) { + cam_periph_unlock(periph); biofinish(bp, NULL, ENOMEM); + cam_periph_lock(periph); xpt_release_ccb(start_ccb); ndaschedule(periph); return; (the mailer may mangle it, so I've also attached it in case people want to comment on this). The logic here is that we have special logic in the ENOMEM case that will recursively call the start routine, which calls the scheduler which expects to be able to take out the periph lock. But it can't, because it's already locked. It also invokes a pacing protocol that slows down things even more. What I'm now not sure about is whether or not we need to just release start_ccb or if we also need to call ndaschedule too here. Seems like we might not want to (but it's a safe nop if not). I've cc'd mav@ to see if he has opinions on what's going on. Warner On Wed, May 22, 2024 at 11:22 AM Alexander Leidinger < Alexander@leidinger.net> wrote: > Hi, > > I got the panic in $Subject. Anyone an idea? > > Complete crashlog available at https://wiki.leidinger.net/core.txt.6 > (1.2 MB) > > Short version: > ---snip--- > [11417] KDB: stack backtrace: > [11417] db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame > 0xfffffe043133f830 > [11417] vpanic() at vpanic+0x13f/frame 0xfffffe043133f960 > [11417] panic() at panic+0x43/frame 0xfffffe043133f9c0 > [11417] __mtx_lock_sleep() at __mtx_lock_sleep+0x491/frame > 0xfffffe043133fa50 > [11417] __mtx_lock_flags() at __mtx_lock_flags+0x9c/frame > 0xfffffe043133fa70 > [11417] ndastrategy() at ndastrategy+0x3c/frame 0xfffffe043133faa0 > [11417] g_disk_start() at g_disk_start+0x569/frame 0xfffffe043133fb00 > [11417] g_io_request() at g_io_request+0x2b6/frame 0xfffffe043133fb30 > [11417] g_io_deliver() at g_io_deliver+0x1cc/frame 0xfffffe043133fb80 > [11417] g_disk_done() at g_disk_done+0xee/frame 0xfffffe043133fbc0 > [11417] ndastart() at ndastart+0x4a3/frame 0xfffffe043133fc20 > [11417] xpt_run_allocq() at xpt_run_allocq+0xa5/frame 0xfffffe043133fc70 > [11417] ndastrategy() at ndastrategy+0x6d/frame 0xfffffe043133fca0 > [11417] g_disk_start() at g_disk_start+0x569/frame 0xfffffe043133fd00 > [11417] g_io_request() at g_io_request+0x2b6/frame 0xfffffe043133fd30 > [11417] g_io_request() at g_io_request+0x2b6/frame 0xfffffe043133fd60 > [11417] g_io_request() at g_io_request+0x2b6/frame 0xfffffe043133fd90 > [11417] vdev_geom_io_start() at vdev_geom_io_start+0x257/frame > 0xfffffe043133fdc0 > [11417] zio_vdev_io_start() at zio_vdev_io_start+0x321/frame > 0xfffffe043133fe10 > [11417] zio_execute() at zio_execute+0x78/frame 0xfffffe043133fe40 > [11417] taskqueue_run_locked() at taskqueue_run_locked+0x1c7/frame > 0xfffffe043133fec0 > [11417] taskqueue_thread_loop() at taskqueue_thread_loop+0xd3/frame > 0xfffffe043133fef0 > ---snip--- > > This is with a world from 2024-05-17-084543. > > Bye, > Alexander. > > -- > http://www.Leidinger.net Alexander@Leidinger.net: PGP 0x8F31830F9F2772BF > http://www.FreeBSD.org netchild@FreeBSD.org : PGP 0x8F31830F9F2772BF >