Per-mount syncer threads and fanout for pagedaemon cleaning

Tue Dec 27 07:11:46 UTC 2011

On 26 Dec, Venkatesh Srinivas wrote:
> Hi!
> 
> I've been playing with two things in DragonFly that might be of interest here.
> 
> Thing #1 :=
> 
> First, per-mountpoint syncer threads. Currently there is a single thread,
> 'syncer', which periodically calls fsync() on dirty vnodes from every mount,
> along with calling vfs_sync() on each filesystem itself (via syncer vnodes).
> 
> My patch modifies this to create syncer threads for mounts that request it.
> For these mounts, vnodes are synced from their mount-specific thread rather
> than the global syncer.
> 
> The idea is that periodic fsync/sync operations from one filesystem should not
> stall or delay synchronization for other ones. 
> 
> The patch was fairly simple:
> http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/50e4012a4b55e1efc595db0db397b4365f08b640
> 
> There's certainly more that could be done in this direction -- the current patch
> does preserve a global syncer ('syncer0') for unflagged filesystems and for
> running the rushjobs logic from speedup_syncer. And the current patch preserves
> the notion of syncer vnodes, which are entirely overkill when there are 
> per-mount sync threads. But its a start and something very similar could apply
> to FreeBSD.

I used to think that something like this was a good idea, but the first
issue that I thought of was that this could cause excessive seeking if
multiple threads were attempting to sync vnodes for multiple partitions
on the same physical device at the same time.

What might be better is one thread per physical device, possibly doing a
simple elevator sort based on partition number and inode number on the
items in each worklist bucket.  It might still be possible to retain the
advantages of this with a one thread per mount point implementation by
adding interlocks (or even just start time offsets) so that they don't
all try to run at once and fight over the head actuator.  Implementing
one thread per mount point does have the advantage of making it easy to
observe which mount points are "busy".

The next complication is that all of the different ways that we have to
slice, dice, and combine storage devices (various forms of RAID, ZFS
pools, etc.) make the concept of a device a lot more complicated.  How
do we optimize?  Should we even try?

One of the things that you didn't mention about syncer vnodes is all of
the nastyness that goes on inside ffs_sync() and friends every time the
syncer gets to the syncer vnode.  That causes a big burst of I/O every
30 seconds.

> Thing #2 :=
> 
> Currently when pagedaemon decides to launder a dirty page, it initiates I/O
> for the launder from within its own thread context. While the I/O is generally
> asynchronous, the call path to get there from pagedaemon is deep and fraught
> with stall points: (for vnode_pager, possible stalls annotated)
> 
> 	pagedaemon scans ->
> 		...
> 		vm_pageout_clean ->							[block on vm_object locks,
> 													 page busy]
> 			vm_pageout_flush ->
> 				vnode_pager_putpages ->
> 						vnode_generic_putpages ->
> 							<fs>_write ->			[block on FS locks]
> 								b(,a,d)write ->		[wait on runningbufspace]
> 									<fs>_stratgy ->
> 										Oh my...
> 
> While any part of this path is stalled, pagedaemon is not continuing to do its
> job; this could be a problem -- so long as it is not laundering pages, we are
> not resolving any page shortages.
> 
> Given Thing #1, we have per-mountpoint service threads; I think it'd be worth
> pushing out the deeper parts of this callpath into those threads. The idea is
> that pagedaemon would select and cluster pages as it does now, but would use
> the syncer threads to walk through the pager and FS layer. An added benefit
> of using the syncer threads is that contention between fsync/vfs_sync on an
> FS and pageout on that same FS would be excluded. The pagedaemon would not 
> wait for the I/O to initiate before continuing to scan more candidates.
> 
> I've not found an ideal place to break up this callchain, but either between
> vm_pageout_clean / vm_pageout_flush, or at the entry to the vnode_pager would
> be good places. In experiments, I've sent the vm_pageout_flush calls off to
> a convenient taskqueue, seems to work okay. But sending them to per-mount
> threads would be better.

The current implementation definitely has the flaws that you mention.  I
remember system deadlocks in years past.  Your idea for a fix looks
interesting.