g_mirror_access() dropping geom topology_lock [Was: Kernel crash trying to import a ZFS pool with log device]

Alexander Motin mav at FreeBSD.org
Fri Mar 21 10:43:47 UTC 2014


On 21.03.2014 11:37, Andriy Gapon wrote:
> on 20/03/2014 23:13 Andreas Longwitz said the following:
> [snip]
>> But if I now run "zpool export" and "zpool import" the kernel crashes:
>> ...
>> vdev_geom_open_by_path:554[1]: Found provider by name /dev/label/C325BL31.
>> vdev_geom_attach:102[1]: Attaching to label/C325BL31.
>> g_access(0xffffff0096b23a00(label/C325BL31), 1, 0, 1)
>> open delta:[r1w0e1] old:[r0w0e0] provider:[r0w0e0]
>> 0xffffff0002c0d400(label/C325BL31)
>> g_access(0xffffff0002ba7300(da2), 1, 0, 2)
>> open delta:[r1w0e2] old:[r0w0e0] provider:[r0w0e0] 0xffffff0002a23800(da2)
>> g_disk_access(da2, 1, 0, 2)
>> vdev_geom_attach:123[1]: Created geom and consumer for label/C325BL31.
>> vdev_geom_read_config:248[1]: Reading config from label/C325BL31...
>> vdev_geom_open_by_path:569[1]: guid match for provider /dev/label/C325BL31.
>> vdev_geom_open_by_path:554[1]: Found provider by name /dev/label/C330CJHW.
>> vdev_geom_attach:102[1]: Attaching to label/C330CJHW.
>> g_access(0xffffff00969a5280(label/C330CJHW), 1, 0, 1)
>> open delta:[r1w0e1] old:[r0w0e0] provider:[r0w0e0]
>> 0xffffff0002c02100(label/C330CJHW)
>> g_access(0xffffff0002b96280(da3), 1, 0, 2)
>> open delta:[r1w0e2] old:[r0w0e0] provider:[r0w0e0] 0xffffff0002a23b00(da3)
>> g_disk_access(da3, 1, 0, 2)
>> vdev_geom_attach:143[1]: Created consumer for label/C330CJHW.
>> vdev_geom_read_config:248[1]: Reading config from label/C330CJHW...
>> vdev_geom_open_by_path:569[1]: guid match for provider /dev/label/C330CJHW.
>> vdev_geom_open_by_path:554[1]: Found provider by name /dev/mirror/gm0p3.
>> vdev_geom_attach:102[1]: Attaching to mirror/gm0p3.
>> g_access(0xffffff0096b24180(mirror/gm0p3), 1, 0, 1)
>> open delta:[r1w0e1] old:[r0w0e0] provider:[r0w0e0]
>> 0xffffff00969c7e00(mirror/gm0p3)
>> g_part_access(mirror/gm0p3,1,0,1)
>> g_access(0xffffff0096c0f800(mirror/gm0), 1, 0, 1)
>> open delta:[r1w0e1] old:[r8w8e16] provider:[r8w8e16]
>> 0xffffff00969c7d00(mirror/gm0)
>> GEOM_MIRROR[2]: Access request for mirror/gm0: r1w0e1.
>
> The following part of the log is very informative.
> Thank you for capturing it.
>
>> vdev_geom_attach:143[1]: Created consumer for mirror/gm0p3.
>> vdev_geom_read_config:248[1]: Reading config from mirror/gm0p3...
>> vdev_geom_open_by_path:569[1]: guid match for provider /dev/mirror/gm0p3.
>
> I read the above as thread A entering vdev_geom_open_by_path and trying to
> "taste" /dev/mirror/gm0p3.
>
>> g_post_event_x(0xffffffff80b16830, 0xffffff0096b24180, 2, 0)
>> vdev_geom_detach:163[1]: Closing access to mirror/gm0p3.
>> g_access(0xffffff0096b24180(mirror/gm0p3), -1, 0, -1)
>> open delta:[r-1w0e-1] old:[r1w0e1] provider:[r1w0e1]
>> 0xffffff00969c7e00(mirror/gm0p3)
>
>
> Simultaneously thread B is closing access /dev/mirror/gm0p3.
> It is not clear from the quoted log when and how this thread B got access to the
> device in the first place.
>
>> g_part_access(mirror/gm0p3,-1,0,-1)
>> g_access(0xffffff0096c0f800(mirror/gm0), -1, 0, -1)
>> open delta:[r-1w0e-1] old:[r9w8e17] provider:[r9w8e17]
>> 0xffffff00969c7d00(mirror/gm0)
>> GEOM_MIRROR[2]: Access request for mirror/gm0: r-1w0e-1.
>> vdev_geom_open_by_path:554[1]: Found provider by name /dev/mirror/gm0p3.
>> vdev_geom_attach:102[1]: Attaching to mirror/gm0p3.
>> vdev_geom_attach:128[1]: Found consumer for mirror/gm0p3.
>> g_access(0xffffff0096b24180(mirror/gm0p3), 1, 0, 1)
>> open delta:[r1w0e1] old:[r1w0e1] provider:[r1w0e1]
>> 0xffffff00969c7e00(mirror/gm0p3)
>
> Thread A is accessing the device.
>
>> g_part_access(mirror/gm0p3,1,0,1)
>> g_access(0xffffff0096c0f800(mirror/gm0), 1, 0, 1)
>> Open delta:[r1w0e1] old:[r9w8e17] provider:[r9w8e17]
>> 0xffffff00969c7d00(mirror/gm0)
>> GEOM_MIRROR[2]: Access request for mirror/gm0: r1w0e1.
>> vdev_geom_detach:167[1]: Destroyed consumer to mirror/gm0p3.
>> g_detach(0xffffff0096b24180)
>> g_destroy_consumer(0xffffff0096b24180)
>
> Thread B is destroying a special ZFS "taster" consumer.
>
>> vdev_geom_attach:147[1]: Used existing consumer for mirror/gm0p3.
>
> Thread A is trying to re-use the taster consumer, which has just been destroyed.
>
>> vdev_geom_read_config:248[1]:
>>
>> Fatal trap 12: page fault while in kernel mode
>
> Boom!
>
> I see two issues here.
> First, the ZFS tasting code could be made more robust.  If it never tried to
> re-use the consumer and always created a new one, then most likely this crash
> could be avoided.  But there is no bug in the code.  The code is correct and it
> it uses GEOM topology lock to avoid any concurrency issues.
>
> But GEOM mirror code breaks a contract on which the ZFS code relies.
> g_access() must be called with the topology lock hold.
> I extend this requirement to a requirement that access method of any GEOM
> provider must operate under the topology lock and must never drop it.
> In other words, if a caller must acquire g_topology_lock before calling
> g_access, then in return it must have a guarantee that the GEOM topology stays
> unchanged across the call to g_access().
> g_mirror_access() breaks the above contract.
>
> So, the code in vdev_geom_attach() obtains g_topology_lock, then it finds an
> existing valid consumer and calls g_access() on it.  It reasonably expects that
> the consumer remains valid, but because g_mirror_access() drops and requires the
> topology lock, there is a chance that the topology can change and the consumer
> may become invalid.
>
> I am not very familiar with gmirror code, so I am not sure how to fix the
> problem from that end.

I can confirm this. I know about this problem for some time already. The 
same issue as shown in GMIRROR is also present in GRAID. AFAIR the 
problem is in keeping lock order between GEOM topology lock and class' 
own lock.

The only "excuse" is that it is not very reasonable to have ZFS on top 
of GMIRROR or GRAID.

-- 
Alexander Motin


More information about the freebsd-fs mailing list