[PATCH] fix ZFS prefetch code

Martin Matuska martin at matuska.org
Wed Aug 10 18:05:13 UTC 2011


The attached patch is a solution for kern/157728.

The new zfs send/recv code doesn't play well with prefetching temporary
clones.
In Illumos, the prefetch code in zfs_ioc_dataset_list_next() is
different from our code:

        /*
         * Pre-fetch the datasets.  dmu_objset_prefetch() always returns 0
         * but is not declared void because its called by dmu_objset_find().
         */
        if (zc->zc_cookie == 0) {
                uint64_t cookie = 0;
                int len = sizeof (zc->zc_name) - (p - zc->zc_name);

                while (dmu_dir_list_next(os, len, p, NULL, &cookie) == 0)
                        (void) dmu_objset_prefetch(p NULL);
        }

Compared to our code:
        /*
         * Pre-fetch the datasets.  dmu_objset_prefetch() always returns 0
         * but is not declared void because its called by dmu_objset_find().
         */
        if (zc->zc_cookie == 0) {
                uint64_t cookie = 0;
                int len = sizeof (zc->zc_name) - (p - zc->zc_name);

                while (dmu_dir_list_next(os, len, p, NULL, &cookie) == 0)
                        (void) dmu_objset_prefetch(zc->zc_name, NULL);
        }

I have done some debugging and Pawel's change is right, with "p" it
doesn't work, dmu_objset_prefetch() rejects all entries on
dmu_objset_hold() and everything is just skipped because names are
incorrect) -> Solaris and Illumos do not prefetch datasets.

How does this race happen?
At the end of dmu_recv_existing_end() we destroy the temporary clone
without checking any result code:

(void) dsl_dataset_destroy(drc->drc_real_ds, dmu_recv_tag, B_FALSE);

dsl_dataset_destroy() fails if there is an extra hold on the dataset -
and dmu_objset_prefetch() is the function that places this hold.
Because the prefetch code in Solaris doesn't work (is unfixed), the
don't have this race.

Now we have this code and comments in dmu_objset.c
(dmu_objset_snapshot_one()):
        /*
         * If the objset starts with a '%', then ignore it unless it was
         * explicitly named (ie, not recursive).  These hidden datasets
         * are always inconsistent, and by not opening them here, we can
         * avoid a race with dsl_dir_destroy_check().
         */
        cp = strrchr(name, '/');
        if (cp && cp[1] == '%' && sn->recursive)
                return (0);

        (void) strcpy(sn->failed, name);

        /*
         * Check permissions if we are doing a recursive snapshot.  The
         * permission checks for the starting dataset have already been
         * performed in zfs_secpolicy_snapshot()
         */
        if (sn->recursive && (err = zfs_secpolicy_snapshot_perms(name,
CRED())))
                return (err);

        err = dmu_objset_hold(name, sn, &os);
        if (err != 0)
                return (err);

        /*
         * If the objset is in an inconsistent state (eg, in the process
         * of being destroyed), don't snapshot it.  As with %hidden
         * datasets, we return EBUSY if this name was explicitly
         * requested (ie, not recursive), and otherwise ignore it.
         */
        if (os->os_dsl_dataset->ds_phys->ds_flags & DS_FLAG_INCONSISTENT) {
                dmu_objset_rele(os, sn);
                return (sn->recursive ? 0 : EBUSY);
        }

This is exactly the same race I am talking about.
So when these datasets count as allways inconsistent, we have to do the
same in dmu_objset_prefetch() and I suggest we don't prefetch
inconsistent datasets.
I am submitting this to Illumos as well.

Please review attached patch.

-- 
Martin Matuska
FreeBSD committer
http://blog.vx.sk

-------------- next part --------------
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c	(revision 224760)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c	(working copy)
@@ -1760,10 +1760,29 @@
 dmu_objset_prefetch(const char *name, void *arg)
 {
 	dsl_dataset_t *ds;
+	char *cp;
 
+	/*
+	 * If the objset starts with a '%', then ignore it.
+	 * These hidden datasets are always inconsistent and by not opening
+	 * them here, we can avoid a race with dsl_dir_destroy_check().
+	 */
+	cp = strrchr(name, '/');
+	if (cp && cp[1] == '%')
+		return (0);
+
 	if (dsl_dataset_hold(name, FTAG, &ds))
 		return (0);
 
+	/*
+	 * If the objset is in an inconsistent state (eg, in the process
+	 * of being destroyed), don't prefetch it.
+	 */
+	if (ds->ds_phys->ds_flags & DS_FLAG_INCONSISTENT) {
+		dsl_dataset_rele(ds, FTAG);
+		return (0);
+	}
+
 	if (!BP_IS_HOLE(&ds->ds_phys->ds_bp)) {
 		mutex_enter(&ds->ds_opening_lock);
 		if (ds->ds_objset == NULL) {


More information about the zfs-devel mailing list