[PATCH] fix ZFS prefetch code
Martin Matuska
martin at matuska.org
Wed Aug 10 18:05:13 UTC 2011
The attached patch is a solution for kern/157728.
The new zfs send/recv code doesn't play well with prefetching temporary
clones.
In Illumos, the prefetch code in zfs_ioc_dataset_list_next() is
different from our code:
/*
* Pre-fetch the datasets. dmu_objset_prefetch() always returns 0
* but is not declared void because its called by dmu_objset_find().
*/
if (zc->zc_cookie == 0) {
uint64_t cookie = 0;
int len = sizeof (zc->zc_name) - (p - zc->zc_name);
while (dmu_dir_list_next(os, len, p, NULL, &cookie) == 0)
(void) dmu_objset_prefetch(p NULL);
}
Compared to our code:
/*
* Pre-fetch the datasets. dmu_objset_prefetch() always returns 0
* but is not declared void because its called by dmu_objset_find().
*/
if (zc->zc_cookie == 0) {
uint64_t cookie = 0;
int len = sizeof (zc->zc_name) - (p - zc->zc_name);
while (dmu_dir_list_next(os, len, p, NULL, &cookie) == 0)
(void) dmu_objset_prefetch(zc->zc_name, NULL);
}
I have done some debugging and Pawel's change is right, with "p" it
doesn't work, dmu_objset_prefetch() rejects all entries on
dmu_objset_hold() and everything is just skipped because names are
incorrect) -> Solaris and Illumos do not prefetch datasets.
How does this race happen?
At the end of dmu_recv_existing_end() we destroy the temporary clone
without checking any result code:
(void) dsl_dataset_destroy(drc->drc_real_ds, dmu_recv_tag, B_FALSE);
dsl_dataset_destroy() fails if there is an extra hold on the dataset -
and dmu_objset_prefetch() is the function that places this hold.
Because the prefetch code in Solaris doesn't work (is unfixed), the
don't have this race.
Now we have this code and comments in dmu_objset.c
(dmu_objset_snapshot_one()):
/*
* If the objset starts with a '%', then ignore it unless it was
* explicitly named (ie, not recursive). These hidden datasets
* are always inconsistent, and by not opening them here, we can
* avoid a race with dsl_dir_destroy_check().
*/
cp = strrchr(name, '/');
if (cp && cp[1] == '%' && sn->recursive)
return (0);
(void) strcpy(sn->failed, name);
/*
* Check permissions if we are doing a recursive snapshot. The
* permission checks for the starting dataset have already been
* performed in zfs_secpolicy_snapshot()
*/
if (sn->recursive && (err = zfs_secpolicy_snapshot_perms(name,
CRED())))
return (err);
err = dmu_objset_hold(name, sn, &os);
if (err != 0)
return (err);
/*
* If the objset is in an inconsistent state (eg, in the process
* of being destroyed), don't snapshot it. As with %hidden
* datasets, we return EBUSY if this name was explicitly
* requested (ie, not recursive), and otherwise ignore it.
*/
if (os->os_dsl_dataset->ds_phys->ds_flags & DS_FLAG_INCONSISTENT) {
dmu_objset_rele(os, sn);
return (sn->recursive ? 0 : EBUSY);
}
This is exactly the same race I am talking about.
So when these datasets count as allways inconsistent, we have to do the
same in dmu_objset_prefetch() and I suggest we don't prefetch
inconsistent datasets.
I am submitting this to Illumos as well.
Please review attached patch.
--
Martin Matuska
FreeBSD committer
http://blog.vx.sk
-------------- next part --------------
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c (revision 224760)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c (working copy)
@@ -1760,10 +1760,29 @@
dmu_objset_prefetch(const char *name, void *arg)
{
dsl_dataset_t *ds;
+ char *cp;
+ /*
+ * If the objset starts with a '%', then ignore it.
+ * These hidden datasets are always inconsistent and by not opening
+ * them here, we can avoid a race with dsl_dir_destroy_check().
+ */
+ cp = strrchr(name, '/');
+ if (cp && cp[1] == '%')
+ return (0);
+
if (dsl_dataset_hold(name, FTAG, &ds))
return (0);
+ /*
+ * If the objset is in an inconsistent state (eg, in the process
+ * of being destroyed), don't prefetch it.
+ */
+ if (ds->ds_phys->ds_flags & DS_FLAG_INCONSISTENT) {
+ dsl_dataset_rele(ds, FTAG);
+ return (0);
+ }
+
if (!BP_IS_HOLE(&ds->ds_phys->ds_bp)) {
mutex_enter(&ds->ds_opening_lock);
if (ds->ds_objset == NULL) {
More information about the zfs-devel
mailing list