[PATCH] Syncer rewriting

Thu Apr 15 10:10:24 UTC 2010

With a fundamental aid by Giovanni Trematerra and Peter Holm, I
rewrote the syncer following plans and discussions happened
over the last 2 years and started by a Jeff's effort during BSDCan
2008 (for a more complete reference you may check:
http://people.freebsd.org/~jeff/bsdcanbuf.pdf ).

Summarizing a bit, the syncer suffers of the following problems:
- Poor scalability:  just one thread that needs to serve all the
several different mounted filesystems
- Poor flexibility: the current syncer is just used to sync on disk
dirty buffers and nothing else, catering buffer-cache based
filesystems
- Complex design: in order to DTRT, syncer needs the help of a syncer
vnode and introduce some complex locking pattern. Additively, as a
partial mitigation, a separate queue for the !MPSAFE filesystem might
be added
- Poor performance: that is actually more FS specific than anything.
UFS (but I'm not sure if this is the only one), after have synced the
dirty vnodes, does a VFS_SYNC() that actually re-synces all the
referenced vnodes. That means dirty vnodes will be synced 2 times in
the same timeframe.

The rewriting wants to address all these problems.
The main idea is to offer a simple and opaque layer that interacts
directly with the VFS and that any filesystem may override in order to
offer their own implementation of syncer ability. Right now, the layer
lives within the VFS_* methods and the mount structure. More precisely
it offers 5 virtual functions (VFS_SYNCER_INIT, VFS_SYNCER_DESTROY,
VFS_SYNCER_ATTACH, VFS_SYNCER_DETACH, VFS_SYNCER_SPEEDUP) and an
opaque, private pointer for storing syncer-specific datas.
This means the syncer design may not stuck to the specific
thread/process model as it is now, for a given filesystem.
Also, this design may be easilly extended in order to support more
features, if needed.

The syncer, meant as what we have now, becames the 'standard one' but
switches to a different model. It becames per-mount and it then gets
rid of the syncer vnode. This also helps in simplifying a lot the
locking within the syncer because now any thread is responsible only
for its own dog-food.
Filesystems specify their own syncer in the vfsops or they receive, by
default, the buffer cache "standard" syncer. Current filesystems not
using the buffer cache, however, may use the VFS_EOPNOTSUPP
specification in order to avoid completely defining a filesystem
syncer.

The patch has been tested intensively by trema and pho on a lot of
different workload and filesystems:
http://www.freebsd.org/~attilio/syncer_beta_0.diff

Sparse notes:
- The performance problem, even if the patch doesn't currently
supports it, may be easilly addressed now by skipping syncing, in
ffs_fsync() for the MNT_LAZY case and having ffs_sync() taking care of
it.
- The standard syncer may be further improved getting rid of the
bufobj. It should actually handle a list of vnodes rather than a list
of bufobj. However similar optimizations may be done after the patch
is ready to enter the tree.
- The mount interlock now protects the bo_flag & BO_ONWORKLST and the
synclist iterator, thus there is no need to hold the bufobj lock when
accessing them. However the specific for checking if a bufobj is dirty
or not are still protected by bufobj lock, thus the insertion path
still needs of it too.

Notably things that I would receive comments on are mostly linked to
the default syncer:
- I didn't use any form of threads consolidation for threads
automatically assigned by the default syncer. We may have different
opinion and good arguments on it.
- Something we might be willing is to think about the !SMP case. Maybe
we don't want the multi-thread approach for that case? Should we
revert the current approach for !SMP?
- Right now the VFS_SYNCER_INIT() and VFS_SYNCER_DESTROY() are used
not only for flexibility but also for necessity by the default syncer.
Some consumers may be willing to fill-in the workitem queues earlier
than the syncer starts (VFS_SYNCER_ATTACH()) and you may not want to
loose such filled vnodes. This approach is good and offers the
possibility to also support mount state updates simply without loosing
informations, but it has the dis-advantage to allocate structures for
filesystems that may forever be RO.

More testing, reviews and comments are undoubtly welcome at this point.

Thanks,
Attilio

-- 
Peace can only be achieved by understanding - A. Einstein