Re: zfs support in makefs

From: Allan Jude <allanjude_at_freebsd.org>
Date: Wed, 18 May 2022 22:31:36 UTC
On 5/18/2022 3:03 PM, Mark Johnston wrote:
> Hi,
> 
> For the past little while I've been working on ZFS support in makefs(8).
> At this point I'm able to create a bootable FreeBSD VM image, using the
> standard FreeBSD ZFS layout, and run through the regression test suite
> in bhyve.  I've also been able to create and boot an EC2 AMI.
> 
> Some background is below for anyone interested, and I would greatly
> appreciate feedback on the interface, described further below.
> 
> The initial diff is here: https://reviews.freebsd.org/D35248
> 
> Comments here or in the review are welcome.
> 
> === Background ===
> 
> The goal is to enable creation of ZFS-based VM images, in particular by
> release(7).  Currently one can implement this by creating a pool on a
> file-backed memory disk and populating it with "make installworld", but
> this has a few drawbacks:
> 
> 1. The resulting images are not reproducible.  That is, if one creates
>     two ZFS images with identical contents, the images themselves will not
>     be byte-identical.  For instance, each pool gets a randomly generated
>     GUID, as does each vdev, and there are other sources of
>     non-determinism besides.
> 2. Creating a zpool requires root privileges by default and can't be done
>     at all in a jail.
> 3. Populating the image is a resource-intensive operation, the kernel
>     will cache the output files until the pool is exported, etc.
> 
> For UFS images we use makefs to solve these problems, so I wanted to try
> and take the same approach for ZFS.  I assume that the appeal of using
> ZFS as the root filesystem for VMs is obvious.
> 
> I initially implemented ZFS support in makefs using libzpool.so, which
> is effectively a copy of the OpenZFS kernel code compiled for userspace.
> It is mostly used for testing and debugging.  This worked and was
> relatively simple to implement, but it only solved problem 2.  Bending
> libzpool to satisfy my requirements seemed difficult, and the result
> would require continuous maintenance as OpenZFS evolves and its internal
> interfaces change.  I spent some time hacking libzpool to limit its
> memory and CPU usage and gave up; while it was functional, the result
> was painfully slow.
> 
> I then looked at the bits used by the loader to load files off of a boot
> volume, and implemented the creation of ZFS images from scratch, i.e.,
> without reusing OpenZFS code.  This required more effort but I believe
> it'll be easier to maintain in the long run, and it solves all three
> problems above.
> 
> The implementation is mostly derived from an old ZFS on-disk format
> specification (http://www.giis.co.in/Zfs_ondiskformat.pdf), various blog
> posts, and lots of time spent staring at zdb output.  I reused some code
> from the boot loader: the nvlist implementation, since the one in
> sys/contrib doesn't have some required features, and zfsimpl.h, which
> contains C structs describing various on-disk data structures.
> 
> ZFS in general is pretty complex so this effort required some
> specialization to the problem at hand.  In particular, makefs
> - always creates a pool with a single disk vdev with all data written in
>    a single transaction group; there's no snapshots, no RAID-Z/dRAID, no
>    redundant block copies, no ZIL, no encryption, no gang blocks, no
>    zvol, etc.
> - does not implement compression,
> - doesn't preserve holes in files,
> - always creates pools at version 5000, i.e., all feature flags are off
>    and have to be enabled separately,
> - does not try to do any clever metaslab placement or sizing, on the
>    basis that the pool will likely be expanded upon first boot anyway,
> - doesn't use spill blocks and is not particularly clever when it comes
>    to choosing block sizes, creating some avoidable internal
>    fragmentation (though it doesn't seem too bad relative to OpenZFS
>    without compression, maybe 10% overhead in some unscientific tests)
> 
> Some of these can be addressed (especially compression and sparse file
> support), but I wanted to get some feedback before spending more time on
> this.  Really this thing is just intended to do the minimum necessary to
> provide ZFS-based VM images.
> 
> === Interface ===
> 
> Creating a pool with a single dataset is easy:
> 
> $ makefs -t zfs -s 10g -o poolname=test ./zfs.img /path/to/input
> 
> Upon importing such a pool, you'll get a dataset named "test" mounted at
> /test containing everything under /path/to/input.
> 
> It's possible to set properties on the root dataset:
> 
> $ makefs -t zfs -s 10g -o poolname=test -o fs=test:setuid=off:atime=on ./zfs.img /path/to/input
> 
> It's also possible to create additional datasets:
> 
> $ makefs -t zfs -s 10g -o poolname=test -o fs=test/ds1:mountpoint=/test/dir1 ./zfs.img /path/to/input
> 
> The parameter syntax is
> "-o fs=<dataset name>[:<prop1>=<val1>[:<prop2>=<val2>[:...]]]".  Only a
> few properties are supported, at least for now.
> 
> Dataset mountpoints behave the same as they would if created with the
> standard ZFS tools.  So by default the root dataset's mountpoint is
> /test, test/ds1's mountpoint is /test/ds1, etc..  If a dataset overrides
> its default mountpoint, its children inherit that mountpoint.
> 
> makefs builds the output filesystem using a single input directory tree.
> Thus, makefs -t zfs requires that at least one of the dataset's
> mountpoints map to /path/to/input; that is, there is a "root" mount
> point.
> 
> The -o rootpath parameter defines this root mount point.  By default it's
> "/<poolname>".  All datasets in the pool must have their mountpoints
> under this path, and one dataset's mountpoint must be equal to this
> path.  To build bootable images, one sets -o rootpath=/.
> 
> Putting it all together, one can build a image using the standard layout
> with an invocation like this:
> 
> makefs -t zfs -o poolname=zroot -s 20g -o rootpath=/ -o bootfs=zroot/ROOT/default \
>      -o fs=zroot:canmount=off:mountpoint=none \
>      -o fs=zroot/ROOT:mountpoint=none \
>      -o fs=zroot/ROOT/default:mountpoint=/ \
>      -o fs=zroot/tmp:mountpoint=/tmp:exec=on:setuid=off \
>      -o fs=zroot/usr:mountpoint=/usr:canmount=off \
>      -o fs=zroot/usr/home \
>      -o fs=zroot/usr/ports:setuid=off \
>      -o fs=zroot/usr/src \
>      -o fs=zroot/usr/obj \
>      -o fs=zroot/var:mountpoint=/var:canmount=off \
>      -o fs=zroot/var/audit:setuid=off:exec=off \
>      -o fs=zroot/var/crash:setuid=off:exec=off \
>      -o fs=zroot/var/log:setuid=off:exec=off \
>      -o fs=zroot/var/mail:atime=on \
>      -o fs=zroot/var/tmp:setuid=off \
>      ${HOME}/tmp/zfs.img ${HOME}/tmp/world
> 
> I'll admit this is somewhat clunky, but it doesn't seem worse than what
> we have to do otherwise, see poudriere-image for example:
> https://github.com/freebsd/poudriere/blob/master/src/share/poudriere/image_zfs.sh#L79
> 
> What do folks think of this interface?  Is there anything missing, or
> anything that doesn't make sense?
> 

Really nice work on this. It sounds like you've covered everything.

Only thing that jumps out at me is specifically because it is designed 
to be resized upwards, we may want to look at being a bit clever with 
the metaslabs. Specifically, the normal thing is to create ~200 of them, 
but if the image is only 10g, it is likely to create really small 
metaslabs, but all metaslabs are the same size, so if the pool is grown 
to a few TB, it will have a lot of tiny metaslabs, and it might make 
more sense to set a floor for the size, or offer the user some control 
when creating the pool. There is a sysctl that controls this in the 
normal code paths, maybe we can easily allow setting that like zdb does 
with its -o flag.




-- 
Allan Jude