kern/72041: Deadlock when disk is destroyed while user process
brian at midstream.com
Thu Sep 23 11:30:27 PDT 2004
>Synopsis: Deadlock when disk is destroyed while user process closes
>Arrival-Date: Thu Sep 23 18:30:27 GMT 2004
>Originator: Brian Eng
FreeBSD lexington.midstream.com 5.2.1-RELEASE FreeBSD 5.2.1-RELEASE #9: Thu Sep 2 14:23:04 PDT 2004 brian at lexington.midstream.com:/usr/src/sys/i386/compile/BRIAN i386
The deadlock is between the geom code and the cam code. It occurred when a fibre channel cable was removed when a user process was still accessing a disk through it.
The system is set up to do a 'camcontrol rescan' upon indication from the HBA driver that the storage devices in the system may have changed. 'camcontrol rescan' triggers a succession of SCSI commands that are driven by the cambio/camisr() software interrupt. When the cable was unplugged, this led to cambio calling disk_destroy() on the disks that were now lost. disk_destroy() led to an attempt to acquire topology_lock() in the g_event thread.
Meanwhile, the user app (dd) received an I/O error and closed the device. This led to a call to g_dev_close(), which acquired topology_lock() and then went down to daclose(), which sent a SCSI SYNC_CACHE command and waited for the command to complete.
The SYNC_CACHE command completes, but the syscall is never told by cambio, which is frozen waiting for the lock that the syscall is holding.
Do 'camcontrol rescan' either continuously or upon driver notification of changes. Set up a bunch of processes (I was using 'dd') to read a removable disk, then remove it while the processes are running.
There may also be a scenario with disk_create.
One perspective on this is that cambio inverted the layers; normally, geom code calls cam code, but in the 'camcontrol rescan' case, cam code calls geom code, resulting in locks being taken in opposite order. Perhaps disk_destroy could just queue to g_event and not wait for completion.
More information about the freebsd-bugs