[ZFS][PATCH] Read-only root file system prevents SPA event processing

Justin T. Gibbs gibbs at scsiguy.com
Fri Feb 18 21:34:27 UTC 2011


ZFS blocks certain types of asynchronous event processing when the
root file system is mounted read-only.  This can prevent ZFS from
closing leaf vdev devices when they go away, from re-probing vdevs when
they experience an error, and from performing certain re-silver actions.

See the lengthy change comments in the attached patch (against a fairly
recent -current) for a full description of the bug and how I fixed it.

Review and comments welcome.

--
Justin
-------------- next part --------------
Change 478953 by justing at justing-ns1 on 2011/02/18 09:54:56

	Allow ZFS asynchronous event handling to proceed even if the
	root file system is mounted read-only.  This restriction appears
	to have been put in place to avoid errors with updating the
	configuration cache file.  However:
	
	  o The majority of asynchronous event handling does not involve
	    configuration cache file updates.
	  o The configuration cache file need not be on the root file system,
	    so the check was not complete.
	  o Other classes of errors (e.g. file system full) can also prevent
	    a successful update yet do not prevent asynchronous event
	    processing.
	  o Configurations such as NanoBSD never have a read-write root,
	    so ZFS event processing is permanently disabled in these
	    systems.
	  o Failure to handle asynchronous events promptly can extend the
	    window of time that a pool is in a critical state.
	
	At worst, a missed configuration cache update will force the
	operator to perform a manual "zfs import" (note -f is not required)
	to inform the system about a newly created pool.  To minimize the
	likelihood of this rare occurrence, configuration cache write failures
	now emit FMA events (via devctl) so the operator can take corrective
	action, and the write is retried every 5 minutes.  The retry interval,
	in seconds, is tunable via the sysctl "vfs.zfs.ccw_retry_interval".
	
	As a side effect of reporting configuration cache events, other
	sysevents, such as re-silver start/stop, are now also reported via
	devctl.
	
	sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:
		o As is done in zfs_fm.c, provide a manual declaration for
		  devctl_notify().  Both declarations could be combined
		  into spa_impl.h, but the declaration is fault management
		  related, not spa specific.  sys/fm/fs/zfs.h would be ideal
		  if it weren't so public and reserved for FMA string
		  definitions.  I'm open to suggestions on how to improve
		  this nit while minimizing our divergence from Solaris.
		o Use devctl_notify() to implement sysevent support in
		  spa_event_notify().  The subsystem is EC_ZFS so that
		  these events can never collide with those emitted in
		  zfs_fm.c.
		o Add the sysctl "vfs.zfs.ccw_retry_interval".  The value
		  defaults to 5 minutes and is used to rate limit, on a
		  per-pool basis, configuration cache file write attempts.
		o Modify spa_async_dispatch to honor configuration cache
		  write limiting.  If other events are pending, a configuration
		  cache write will be attempted at the same time, so the
		  rate limiting only applies when the asynchronous dispatch
		  system is otherwise idle.  Async events should be rare
		  (e.g. device arrival/departure) and configuration cache
		  writes rarer, so a more complicated system to strictly
		  honor the retry limit seems unwarranted.
		o Remove check in spa_async_dispatch() for the root file
		  system being read-write.
	
	sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_config.c:
		Instead of silently ignoring configuration cache write
		failures, report them via a new FMA event as well as
		to the console.  The current zfs_ereport_post() doesn't
		allow arbitrary name=value pairs to be appended to the
		report, so the configuration cache file name is only
		available on the console output.  This limitation should
		be addressed in a future update.
	
	sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa_impl.h:
		Add a uint64_t to the spa data structure to track the
		time (via LBOLT) of the last configuration cache file
		write failure.  This is referenced in spa_async_dispatch()
		to effect the rate limiting.
	
	sys/cddl/contrib/opensolaris/uts/common/sys/fm/fs/zfs.h:
		Add FM_EREPORT_ZFS_CONFIG_CACHE_WRITE as an ereport class.

Affected files ...

... //depot/SpectraBSD/head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c#5 edit
... //depot/SpectraBSD/head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_config.c#3 edit
... //depot/SpectraBSD/head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa_impl.h#5 edit
... //depot/SpectraBSD/head/sys/cddl/contrib/opensolaris/uts/common/sys/fm/fs/zfs.h#3 edit

Differences ...

==== //depot/SpectraBSD/head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c#5 (text) ====

@@ -62,13 +62,29 @@
 #include "zfs_prop.h"
 #include "zfs_comutil.h"
 
+#ifdef _KERNEL
+/* Including sys/bus.h is just too hard, so I declare what I need here. */
+extern void devctl_notify(const char *__system, const char *__subsystem,
+    const char *__type, const char *__data);
+#endif
+
 /* Check hostid on import? */
 static int check_hostid = 1;
 
+/*
+ * The interval at which failed configuration cache file writes
+ * should be retried.
+ */
+static int zfs_ccw_retry_interval = 300;
+
 SYSCTL_DECL(_vfs_zfs);
 TUNABLE_INT("vfs.zfs.check_hostid", &check_hostid);
 SYSCTL_INT(_vfs_zfs, OID_AUTO, check_hostid, CTLFLAG_RW, &check_hostid, 0,
     "Check hostid on import?");
+TUNABLE_INT("vfs.zfs.ccw_retry_interval", &zfs_ccw_retry_interval);
+SYSCTL_INT(_vfs_zfs, OID_AUTO, ccw_retry_interval, CTLFLAG_RW,
+    &zfs_ccw_retry_interval, 0,
+    "Configuration cache file write, retry after failure, interval (seconds)");
 
 enum zti_modes {
 	zti_mode_fixed,			/* value is # of threads (min 1) */
@@ -3819,13 +3835,33 @@
 	mutex_exit(&spa->spa_async_lock);
 }
 
+static int
+spa_async_tasks_pending(spa_t *spa)
+{
+	u_int non_config_tasks;
+	u_int config_task;
+	uint64_t time_since_ccw_failure;
+	boolean_t config_task_suspended;
+
+	non_config_tasks = spa->spa_async_tasks & ~SPA_ASYNC_CONFIG_UPDATE;
+	config_task = spa->spa_async_tasks & SPA_ASYNC_CONFIG_UPDATE;
+	if (time_since_ccw_failure == 0)
+		config_task_suspended = B_FALSE;
+	else
+		config_task_suspended = (LBOLT - spa->spa_ccw_fail_time)
+		    < (zfs_ccw_retry_interval * hz);
+
+	return (non_config_tasks || (config_task && !config_task_suspended));
+}
+
 static void
 spa_async_dispatch(spa_t *spa)
 {
 	mutex_enter(&spa->spa_async_lock);
-	if (spa->spa_async_tasks && !spa->spa_async_suspended &&
+	if (spa_async_tasks_pending(spa) &&
+	    !spa->spa_async_suspended &&
 	    spa->spa_async_thread == NULL &&
-	    rootdir != NULL && !vn_is_readonly(rootdir))
+	    rootdir != NULL)
 		spa->spa_async_thread = thread_create(NULL, 0,
 		    spa_async_thread, spa, 0, &p0, TS_RUN, maxclsyspri);
 	mutex_exit(&spa->spa_async_lock);
@@ -4479,52 +4515,40 @@
 void
 spa_event_notify(spa_t *spa, vdev_t *vd, const char *name)
 {
-#if 0
 #ifdef _KERNEL
-	sysevent_t		*ev;
-	sysevent_attr_list_t	*attr = NULL;
-	sysevent_value_t	value;
-	sysevent_id_t		eid;
+	char buf[1024];
+	struct sbuf sb;
+	struct timespec ts;
+	int error;
+
+	nanotime(&ts);
 
-	ev = sysevent_alloc(EC_ZFS, (char *)name, SUNW_KERN_PUB "zfs",
-	    SE_SLEEP);
+	sbuf_new(&sb, buf, sizeof(buf), SBUF_FIXEDLEN);
+	sbuf_printf(&sb, "time=%ju.%ld", (uintmax_t)ts.tv_sec, ts.tv_nsec);
 
-	value.value_type = SE_DATA_TYPE_STRING;
-	value.value.sv_string = spa_name(spa);
-	if (sysevent_add_attr(&attr, ZFS_EV_POOL_NAME, &value, SE_SLEEP) != 0)
-		goto done;
+	/*
+	 * Serialize sysevent and ereport generation.
+	 */
+	mutex_enter(&spa->spa_errlist_lock);
 
-	value.value_type = SE_DATA_TYPE_UINT64;
-	value.value.sv_uint64 = spa_guid(spa);
-	if (sysevent_add_attr(&attr, ZFS_EV_POOL_GUID, &value, SE_SLEEP) != 0)
-		goto done;
+	sbuf_printf(&sb, " %s=%s", ZFS_EV_POOL_NAME, spa_name(spa));
+	sbuf_printf(&sb, " %s=%ju", ZFS_EV_POOL_GUID, spa_guid(spa));
 
 	if (vd) {
-		value.value_type = SE_DATA_TYPE_UINT64;
-		value.value.sv_uint64 = vd->vdev_guid;
-		if (sysevent_add_attr(&attr, ZFS_EV_VDEV_GUID, &value,
-		    SE_SLEEP) != 0)
-			goto done;
+		sbuf_printf(&sb, " %s=%ju", ZFS_EV_VDEV_GUID, vd->vdev_guid);
 
 		if (vd->vdev_path) {
-			value.value_type = SE_DATA_TYPE_STRING;
-			value.value.sv_string = vd->vdev_path;
-			if (sysevent_add_attr(&attr, ZFS_EV_VDEV_PATH,
-			    &value, SE_SLEEP) != 0)
-				goto done;
+			sbuf_printf(&sb, " %s=%s", ZFS_EV_VDEV_PATH,
+			    vd->vdev_path);
 		}
 	}
 
-	if (sysevent_attach_attributes(ev, attr) != 0)
-		goto done;
-	attr = NULL;
+	mutex_exit(&spa->spa_errlist_lock);
 
-	(void) log_sysevent(ev, SE_SLEEP, &eid);
-
-done:
-	if (attr)
-		sysevent_free_attr(attr);
-	sysevent_free(ev);
-#endif
+	error = sbuf_finish(&sb);
+	devctl_notify("ZFS", EC_ZFS, name, sbuf_data(&sb));
+	if (error != 0)
+		printf("ZFS WARNING: sysevent sbuf overflowed.\n");
+	sbuf_delete(&sb);
 #endif
 }

==== //depot/SpectraBSD/head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_config.c#3 (text) ====

@@ -25,6 +25,7 @@
  */
 
 #include <sys/zfs_context.h>
+#include <sys/fm/fs/zfs.h>
 #include <sys/spa.h>
 #include <sys/spa_impl.h>
 #include <sys/nvpair.h>
@@ -151,20 +152,21 @@
 		kobj_close_file(file);
 }
 
-static void
+static int
 spa_config_write(spa_config_dirent_t *dp, nvlist_t *nvl)
 {
 	int oflags = FWRITE | FTRUNC | FCREAT | FOFFMAX;
 	char *buf, *temp;
 	size_t buflen;
 	vnode_t *vp;
+	int err;
 
 	/*
 	 * If the nvlist is empty (NULL), then remove the old cachefile.
 	 */
 	if (nvl == NULL) {
-		(void) vn_remove(dp->scd_path, UIO_SYSSPACE, RMFILE);
-		return;
+		err = vn_remove(dp->scd_path, UIO_SYSSPACE, RMFILE);
+		return (err);
 	}
 
 	/*
@@ -185,11 +187,12 @@
 	 */
 	(void) snprintf(temp, MAXPATHLEN, "%s.tmp", dp->scd_path);
 
-	if (vn_open(temp, UIO_SYSSPACE, oflags, 0644, &vp, CRCREAT, 0) == 0) {
-		if (vn_rdwr(UIO_WRITE, vp, buf, buflen, 0, UIO_SYSSPACE,
-		    0, RLIM64_INFINITY, kcred, NULL) == 0 &&
-		    VOP_FSYNC(vp, FSYNC, kcred, NULL) == 0) {
-			(void) vn_rename(temp, dp->scd_path, UIO_SYSSPACE);
+	err = vn_open(temp, UIO_SYSSPACE, oflags, 0644, &vp, CRCREAT, 0);
+	if (err == 0) {
+		if ((err = vn_rdwr(UIO_WRITE, vp, buf, buflen, 0, UIO_SYSSPACE,
+		    0, RLIM64_INFINITY, kcred, NULL)) == 0 &&
+		    (err = VOP_FSYNC(vp, FSYNC, kcred, NULL)) == 0) {
+			err = vn_rename(temp, dp->scd_path, UIO_SYSSPACE);
 		}
 		(void) VOP_CLOSE(vp, oflags, 1, 0, kcred, NULL);
 	}
@@ -198,6 +201,7 @@
 
 	kmem_free(buf, buflen);
 	kmem_free(temp, MAXPATHLEN);
+	return (err);
 }
 
 /*
@@ -209,6 +213,8 @@
 {
 	spa_config_dirent_t *dp, *tdp;
 	nvlist_t *nvl;
+	boolean_t ccw_failure;
+	int error;
 
 	ASSERT(MUTEX_HELD(&spa_namespace_lock));
 
@@ -220,6 +226,7 @@
 	 * cachefile is changed, the new one is pushed onto this list, allowing
 	 * us to update previous cachefiles that no longer contain this pool.
 	 */
+	ccw_failure = B_FALSE;
 	for (dp = list_head(&target->spa_config_list); dp != NULL;
 	    dp = list_next(&target->spa_config_list, dp)) {
 		spa_t *spa = NULL;
@@ -252,10 +259,35 @@
 			mutex_exit(&spa->spa_props_lock);
 		}
 
-		spa_config_write(dp, nvl);
+		error = spa_config_write(dp, nvl);
+		if (error != 0) {
+	
+			printf("ZFS ERROR: Update of cache file %s failed: "
+			    "Errno %d\n", dp->scd_path, error);
+			ccw_failure = B_TRUE;
+		}
+
 		nvlist_free(nvl);
 	}
 
+	if (ccw_failure) {
+		/*
+		 * Keep trying so that configuration data is 
+		 * written if/when any temporary filesystem
+		 * resource issues are resolved.
+		 */
+		target->spa_ccw_fail_time = LBOLT;
+		spa_async_request(target, SPA_ASYNC_CONFIG_UPDATE);
+		zfs_ereport_post(FM_EREPORT_ZFS_CONFIG_CACHE_WRITE,
+		    target, NULL, NULL, 0, 0);
+	} else {
+		/*
+		 * Do not rate limit future attempts to update
+		 * the config cache.
+		 */
+		target->spa_ccw_fail_time = 0;
+	}
+
 	/*
 	 * Remove any config entries older than the current one.
 	 */

==== //depot/SpectraBSD/head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa_impl.h#5 (text) ====

@@ -172,6 +172,7 @@
 	int		spa_minref;		/* num refs when first opened */
 	int		spa_mode;		/* FREAD | FWRITE */
 	spa_log_state_t spa_log_state;		/* log state */
+	uint64_t	spa_ccw_fail_time;	/* Conf cache write fail time */
 	/*
 	 * spa_refcnt & spa_config_lock must be the last elements
 	 * because refcount_t changes size based on compilation options.

==== //depot/SpectraBSD/head/sys/cddl/contrib/opensolaris/uts/common/sys/fm/fs/zfs.h#3 (text) ====

@@ -46,6 +46,7 @@
 #define	FM_EREPORT_ZFS_IO_FAILURE		"io_failure"
 #define	FM_EREPORT_ZFS_PROBE_FAILURE		"probe_failure"
 #define	FM_EREPORT_ZFS_LOG_REPLAY		"log_replay"
+#define	FM_EREPORT_ZFS_CONFIG_CACHE_WRITE	"config_cache_write"
 
 #define	FM_EREPORT_PAYLOAD_ZFS_POOL		"pool"
 #define	FM_EREPORT_PAYLOAD_ZFS_POOL_FAILMODE	"pool_failmode"


More information about the freebsd-fs mailing list