Fwd: [zfs-discuss] ZFS working group and feature flags proposal [SEC=UNCLASSIFIED]

Thu May 26 02:03:11 UTC 2011

FYI

----- Forwarded message from Matthew Ahrens <mahrens at delphix.com> -----

Date: Wed, 25 May 2011 12:02:04 -0700
From: Matthew Ahrens <mahrens at delphix.com>
To: zfs-discuss at opensolaris.org, developer at lists.illumos.org
Subject: [zfs-discuss] ZFS working group and feature flags proposal
List-Id: <zfs-discuss.opensolaris.org>

The community of developers working on ZFS continues to grow, as does
the diversity of companies betting big on ZFS.  We wanted a forum for
these developers to coordinate their efforts and exchange ideas.  The
ZFS working group was formed to coordinate these development efforts.
The working group encourages new membership.  In order to maintain the
group's focus on ZFS development, candidates should demonstrate
significant and ongoing contribution to ZFS.

The first product of the working group is the design for a ZFS on-disk
versioning method that will allow for distributed development of ZFS
on-disk format changes without further explicit coordination. This
method eliminates the problem of two developers both allocating
version number 31 to mean their own feature.

This "feature flags" versioning allows unknown versions to be
identified, and in many cases the ZFS pool or filesystem can be
accessed read-only even in the presence of unknown on-disk features.
My proposal covers versioning of the SPA/zpool, ZPL/zfs, send stream,
and allocation of compression and checksum identifiers (enum values).

We plan to implement the feature flags this summer, and aim to
integrate it into Illumos.  I welcome feedback on my proposal, and I'd
especially like to hear from people doing ZFS development -- what are
you working on?  Does this meet your needs?  If we implement it, will
you use it?

Thanks,
--matt

ZFS Feature Flags proposal, version 1.0, May 25th 2011

===============================
ON-DISK FORMAT CHANGES
===============================

for SPA/zpool versioning:
	new pool version = SPA_VERSION_FEATURES = 1000
	ZAP objects in MOS, pointed to by DMU_POOL_DIRECTORY_OBJECT = 1
		"features_for_read" -> { feature name -> nonzero if in use }
		"features_for_write" -> { feature name -> nonzero if in use }
		"feature_descriptions" -> { feature name -> description }
	Note that a pool can't be opened "write-only", so the
	features_for_read are always required.  A given feature should
	be stored in either features_for_read or features_for_write, not
	both.
	Note that if a feature is "promoted" from a company-private
	feature to part of a larger distribution (eg. illumos), this can
	be handled in a variety of ways, all of which can be handled
	with code added at that time, without changing the on-disk
	format.

for ZPL/zfs versioning:
	new zpl version = ZPL_VERSION_FEATURES = 1000
	same 3 ZAP objects as above, but pointed to by MASTER_NODE_OBJ = 1
		"features_for_read" -> { feature name -> nonzero if in use }
		"features_for_write" -> { feature name -> nonzero if in use }
		"feature_descriptions" -> { feature name -> description }
	Note that the namespace for ZPL features is separate from SPA
	features (like version numbers), so the same feature name can be
	used for both (eg. for related SPA and ZPL features), but
	compatibility-wise this is not treated specially.

for compression:
	must be at pool version SPA_VERSION_FEATURES
	ZAP object in MOS, pointed to by POOL_DIR_OBJ:
		"compression_algos" -> { algo name -> enum value }
	Existing enum values (0-14) must stay the same, but new
	algorithms may have different enum values in different pools.
	Note that this simply defines the enum value.  If a new algorithm
	is in use, there must also be a corresponding feature in
	features_for_read with a nonzero value.  For simplicity, all
	algorithms, including legacy algorithms with fixed values (lzjb,
	gzip, etc) should be stored here (pending evaluation of
	prototype code -- this may be more trouble than it's worth).

for checksum:
	must be at pool version SPA_VERSION_FEATURES
	ZAP object in MOS, pointed to by POOL_DIR_OBJ:
		"checksum_algos" -> { algo name -> enum value }
	All notes for compression_algos apply here too.

Must also store copy of what's needed to read the MOS in label nvlist:
	"features_for_read" -> { feature name -> nonzero if in use }
	"compression_algos" -> { algo name -> enum value }
	"checksum_algos" -> { algo name -> enum value }

	ZPL information is never needed.
	It's fine to store complete copies of these objects in the label.
	However, space in the label is limited.  It's only *required* to
	store information needed to read the MOS so we can get to the
	definitive version of this information.  Eg, introduce new
	compression algo, but it is never used in the MOS, don't need to
	add it to the label.  Legacy algos with fixed values may be
	omitted from the label nvlist (eg. lzjb, fletcher4).

	The values in the nvlist features_for_read map may be different
	from the values in the MOS features_for_read.  However, they
	must be the same when interpreted as a boolean (ie, the nvlist
	value != 0 iff the MOS value != 0).  This is so that the nvlist
	map need not be updated whenever the "reference count" on a
	feature changes, only when it changes to/from zero.

for send stream:
	new feature flag DRR_FLAG_FEATURES = 1<<16
	BEGIN record has nvlist payload
	nvlist has:
		"features" -> { feature name -> unspecified }
		"types" -> { type name -> enum value }
	types are record types, existing ones are reserved.  New types
	should have a corresponding feature, so presence of an unknown
	type is not an error.  If an unknown type is used, records of
	that type can be safely ignored.  So if a new record type can
	not be safely ignored, a corresponding new feature must be added.

all name formats (feature name, algo name, type name):
	<reverse-dns>:<short-name> eg. com.delphix:raidz4

all ALL_CAPS_STRING_DEFINITIONS will be #defined to the lowercase string, eg:
#define FEATURES_FOR_READ "features_for_read"

===============================
BEHAVIOR CHANGES
===============================

zpool upgrade
	zpool upgrade (no arguments)
		If the pool is at SPA_VERSION_FEATURES, but this
		software supports features which are not listed in the
		features_for_* MOS objects, the pool should be listed as
		available to upgrade.  It's recommended that the short
		name of the available features be listed.

	zpool upgrade -v
		After the list of static versions, each supported
		feature should be listed.

	zpool upgrade -a | <pool>
		The pool or pools will have their features_for_* MOS
		objects updated to list all features supported by this
		software.  Ideally, the value of the newly-added ZAP
		entries will be 0, indicating that the feature is
		enabled but not yet in use.

	zpool upgrade -V <version> -a | <pool>
		The <version> may specify a feature, rather than a
		version number, if the version is already at
		SPA_VERSION_FEATURES.  The feature may be specified by
		its short or full name.  The pool or pools will have
		their features_for_* MOS object updated to list the
		specified feature, and any other features required by
		the specified one.

pool open ("zpool import" and implicit import from zpool.cache)
	If pool is at SPA_VERSION_FEATURES, we must check for feature
	compatibility.  First we will look through entries in the label
	nvlist's features_for_read.  If there is a feature listed there
	which we don't understand, and it has a nonzero value, then we
	can not open the pool.

	Each vendor may decide how much information they want to print
	about the unsupported feature.  It may be a catch all "Pool
	could not be opened because it uses an unsupported feature.", or
	it may be the most verbose message, "Pool could not be opened
	because it uses the following unsupported features: <long
	feature name> <feature description> ...".  Or features from
	known vs foreign vendors may be treated differently (eg. print
	this vendors features description, but not unknown vendors').
	Note that if a feature in the label is not supported, we can't
	access the feature description, so at best we can print the full
	feature name.

	After checking the label's features_for_read, we know we can
	read the MOS, so we will continue opening it and then check the
	MOS's features_for_read.  Note that we will need to load the
	label's checksum_algos and compression_algos before reading any
	blocks.  This should be implemented as:

	If the pool is bring opened for writing, then features_for_write
	must also be checked.  (Note, currently grub and zdb open the
	pool read-only, and the kernel module opens the pool for
	writing.  In the future it would be great to allow the kernel
	module to open the pool read-only.)

zfs upgrade
	Treat this similarly to zpool upgrade, using the filesystem's
	MASTER_NODE's features_for_* objects.

filesystem mount
	Treat this similarly to pool open, using the filesystem's
	MASTER_NODE's features_for_* objects.

zfs receive
	If any unknown features are in the stream's BEGIN record's
	nvlist's "features" entry, then the stream can not be received.

===============================
IMPLEMENTATION NOTES
===============================

	Legacy checksum algorithms are to be stored as follows:
	"com.sun:label" -> 3
	"com.sun:gang_header" -> 4
	"com.sun:zilog" -> 5
	"com.sun:fletcher2" -> 6
	"com.sun:fletcher4" -> 7
	"com.sun:sha256" -> 8
	"com.sun:zilog2" -> 9

	Legacy compression algorithms are to be stored as follows:
	"com.sun:lzjb" -> 3
	"com.sun:empty" -> 4
	"com.sun:gzip-1" -> 5
	"com.sun:gzip-2" -> 6
	"com.sun:gzip-3" -> 7
	"com.sun:gzip-4" -> 8
	"com.sun:gzip-5" -> 9
	"com.sun:gzip-6" -> 10
	"com.sun:gzip-7" -> 11
	"com.sun:gzip-8" -> 12
	"com.sun:gzip-9" -> 13
	"com.sun:zle" -> 14

	Legacy send record types are to be stored as follows:
	"com.sun:begin" -> 0
	"com.sun:object" -> 1
	"com.sun:freeobjects" -> 2
	"com.sun:write" -> 3
	"com.sun:free" -> 4
	"com.sun:end" -> 5
	"com.sun:write_byref" -> 6
	"com.sun:spill" -> 7

	The indirection tables for checksum algorithm, compression
	algorithm, and send stream record type can be implemented as
	follows:

	enum zio_checksum {
		ZIO_CHECKSUM_INHERIT = 0,
		ZIO_CHECKSUM_ON,
		ZIO_CHECKSUM_OFF,
		ZIO_CHECKSUM_LABEL,
		ZIO_CHECKSUM_GANG_HEADER,
		ZIO_CHECKSUM_ZILOG,
		ZIO_CHECKSUM_FLETCHER_2,
		ZIO_CHECKSUM_FLETCHER_4,
		ZIO_CHECKSUM_SHA256,
		ZIO_CHECKSUM_ZILOG2,
		...
		ZIO_CHECKSUM_FUNCTIONS
	};

	const char *zio_checksum_names[] = {
		/* Order must match enum zio_checksum! */
		"inherit",
		"on",
		"off",
		"com.sun:label",
		"com.sun:gang_header"
		"com.sun:zilog"
		"com.sun:fletcher2"
		"com.sun:fletcher4"
		"com.sun:sha256"
		"com.sun:zilog2"
		...
	};

	/*
	 * inherit, on, and off are not stored on disk, so
	 * pre-initialize them here.  Note that 8 bits are used for the
	 * checksum algorithm in the blkptr_t, so there are 256 posible
	 * values.
	 */
	uint8_t checksum_to_index[ZIO_CHECKSUM_FUNCTIONS] = {0, 1, 2};
	enum zio_checksum index_to_checksum[256] = {0, 1, 2};

	void add_checksum_algo(const char *algo_name, uint8_t value) {
		enum zio_checksum i;
		for (i = 0; i < ZIO_CHECKSUM_FUNCTIONS; i++) {
			if (strcmp(algo_name, zio_checksum_names[i]) == 0) {
				checksum_to_index[i] = value;
				index_to_checksum[value] = i;
			}
		}
		/* Ignore any unknown algorithms. */
	}

	#define BP_GET_CHECKSUM(bp) index_to_checksum[BF64_GET(...)]
	#define BP_SET_CHECKSUM(bp, x) BF64_SET(..., checksum_to_index[x])

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

----- End forwarded message -----

IMPORTANT: This email remains the property of the Department of Defence and is subject to the jurisdiction of section 70 of the Crimes Act 1914. If you have received this email in error, you are requested to contact the sender and delete the email.