usb/46176: Background information

Henrik Gulbrandsen henrik at gulbra.net
Sun Jan 27 10:44:36 PST 2008


As promised, here is some extra background information to help people
understand my latest patch attempt for usb/46176, just to make you a
little less nervous when you look at it. I will post a link to this
message at the usual place ( http://www.gulbra.net/freebsd-usb/ ), so
you can find it when it's needed. Now, back to our regular program:

On Sat, 2008-01-26 at 20:40 -0700, M. Warner Losh wrote:
> And my time this month is very limited, as will next month's because
> work is being insane.

I appreciate the fact that you still take the time to comment, since I
would normally get a long email backlog under similar circumstances.

Anyway, whenever you or someone else gets the time to think about this
again, it might be a good idea to have a look at the scsi_da.c patch.
This email will give some extra motivation for it. My memory is already
starting to fade, but according to my handwritten notes, the chain of
events is as follows:

The umass detach correctly triggers the following sequence:

    umass_cam_detach_sim(umass_softc *sc)
        xpt_bus_deregister(cam_sim_path(sc->umass_sim))
        cam_sim_free(ss->umass_sim, TRUE)

...and the only serious bug remaining is that a later unmount will try
to close the da device via daclose, which will access the SIM that has
already been released. Actually, it calls cam_periph_lock(), which will
try to lock a mutex located in the SIM struct.

Now, as you may recall, the SCSI subsystem has actually been designed to
handle disappearing devices, presumably for things like hot swapping, so
the xpt_bus_deregister() call has already invoked this call chain:

    xpt_bus_deregister(cam_sim_path(sc->umass_sim))
        xpt_async(AC_LOST_DEVICE, &bus_path, NULL)
            daasync(...)
                cam_periph_async(...)
                    cam_periph_invalidate(...)
                        daoninvalidate(...)
                            disk_gone(softc->disk) [in geom_disk.c]
                            xpt_print(periph->path, "lost device\n")

The xpt_print(...) gives us some feedback to the /var/log/messages file,
so we know where we are in the process. Under normal circumstances, when
there is no mounted file system to interfere with things, the call stack
would return to cam_periph_invalidate() and do the following:

    cam_periph_invalidate(...)
        camperiphfree(periph)
            dacleanup(periph) [called as periph->periph_dtor(periph) ]
                xpt_print(periph->path, "removing device entry\n");
                disk_destroy(softc->disk) [in geom_disk.c]
                    ...shimmering GEOM magic...

At this point, the disk is gone, and the periph struct will be released.

Unfortunately, in our case the file system still holds a reference to
the periph, which keeps it from being freed in cam_periph_invalidate().
Instead, we end up with an existing device that holds an invalid pointer
to the already released SIM struct, ultimately leading to system panic.

My patch for this problem simply adds an extra explicit daclose() call,
immediately after the "lost device" printing listed above, and sets the
periph pointer in our disk struct to NULL. Some of you may be worried
about that disk_destroy() call. At least, it suddenly made me worried,
so I just double-checked it. It's true that having a destroyed disk is
potentially dangerous, since the file system will normally access this
during the unmount. However, it turns out that GEOM is already prepared
for this case and has a special check in g_disk_access(), so the second
call to daclose() will never actually be made. The alternative option
would have been for GEOM to recognize the extra disk reference and keep
the struct alive until daclose(). In that case, the first statement in
that function would already be checking our periph and return ENXIO when
it's NULL, so we would still be safe. 

>From a system design perspective, it would be cleaner if the daclose()
call originated from the same layer as the original daopen(), but under
the circumstances I don't think this is too bad. It's hard to avoid...

Finally, take the above description with a grain of salt, since I had to
reconstruct some parts of it that were not found in my notes. Also, when
I tested the disk_destroy() stuff a few minutes ago, I realized that the
msdosfs_vfsops.c patch is as vital as the scsi_da.c one, which probably
means that the -o sync option is equally vital, since the msdosfs fix is
intended to be restricted to that. It looks like GEOM has another ENXIO
problem in this area, but I'm not planning to investigate that today...

/Henrik




More information about the freebsd-usb mailing list