Exactly that commit (was Re: Latest -current 100% hang at the
late boot stage)
avg at FreeBSD.org
Thu Jun 23 12:51:40 UTC 2011
on 22/06/2011 23:09 Kenneth D. Merry said the following:
> The GEOM event thread is stuck sleeping in the mtx_sleep() call above. So
> that tells me that one of several things is going on:
> - There is a path in the cd(4) driver where it can call cam_periph_hold()
> but not cam_periph_unhold().
> - There is another thread in the system that has called cam_periph_hold(),
> and has gotten stuck before it can call cam_periph_unhold().
> - The hold/unhold logic is broken, and there is a case where a thread
> waiting for the lock can miss the wakeup. After looking at the code, I
> don't think this is the case, but I may have missed something.
> So it is probably one of the first two cases. From the dmesg, I only see
> cd1 listed, not cd0. So it is possible that cd0 is stuck in the probe code
> somewhere, and the geom code just gets stuck trying to open it when the
> probe hasn't completed.
> Seeing the stack trace for the taskq thread that is running on CPU 0
> (process 100014) might be enlightening, it's hard to say. That may or may
> not show the issue.
> It's possible that this issue is directly related to the commit in
> question; perhaps there is an error being returned that wasn't returned
> before and it isn't being handled right in the cd(4) driver. (The cd(4)
> driver wasn't touched in the commit.)
> It's also possible that the commit in question just changed the timing and
> your system is hitting a race that was there previously.
I have a suspicion that this is actually the case.
More than once I've seen under qemu that the kernel boot non-deterministically
gets stuck in the cd driver. Other people have also bumped into this.
E.g., here's one of the reports that I googled up, it's not exactly the same as
what ache has reported, but somewhat similar:
More information about the freebsd-current