missed clustering for small block sizes in cluster_wbuild()

Tue Jun 11 21:25:34 UTC 2013

On Tue, 11 Jun 2013, Konstantin Belousov wrote:

> On Fri, Jun 07, 2013 at 05:28:11AM +1000, Bruce Evans wrote:
>> I think this is best fixed be fixed by removing the check above and
>> checking here.  Then back out of the changes.  I don't know this code
>> well enough to write the backing out easily.
>
> Could you test this, please ?

It works in limited testing.

I got a panic from not adapting for changed locking when merging it to
my version, and debugging this showed a problem.

> diff --git a/sys/kern/vfs_cluster.c b/sys/kern/vfs_cluster.c
> index b280317..9e1528e 100644
> --- a/sys/kern/vfs_cluster.c
> +++ b/sys/kern/vfs_cluster.c
> ...
> @@ -904,14 +904,10 @@ cluster_wbuild(struct vnode *vp, long size, daddr_t start_lbn, int len,
>
> 				/*
> 				 * Check that the combined cluster
> -				 * would make sense with regard to pages
> -				 * and would not be too large
> +				 * would make sense with regard to pages.
> 				 */

The comment needs more changes.  There is no check "with regard to
pages" now.  The old comment was poorly worded.  The code never made
an extra check that the cluster "would make sense with regard to pages"
here (that check was always later).  What it did was use the page count
in the largeness check.

> -				if ((tbp->b_bcount != size) ||
> -				  ((bp->b_blkno + (dbsize * i)) !=
> -				    tbp->b_blkno) ||
> -				  ((tbp->b_npages + bp->b_npages) >
> -				    (vp->v_mount->mnt_iosize_max / PAGE_SIZE))) {
> +				if (tbp->b_bcount != size ||
> +				    bp->b_blkno + dbsize * i != tbp->b_blkno) {
> 					BUF_UNLOCK(tbp);
> 					break;
> 				}

Contrary to what I said before, the caller doesn't always limit the
cluster size.  Only the cluster_write() caller does that.  The
vfs_bio_awrite() doesn't.  Now it is fairly common to allocate 1 too
many page and have to back out.  This happens even when everything is
aligned.  I observed the following:
- there were a lot of contiguous dirty buffers, and this loop happily built
   up a cluster with 17 pages, though mnt_iosize_max was only 17 pages.
   Perhaps the extra page is necessary if the part of the buffer to be
   written starts at a nonzero offset, but there was no offset in the case
   that I observed (can there be one, and if so, is it limited to an offset
   within the first page?  The general case needs 16 4K extra pages to write
   a 64K-block (when the offset of the area to be written is 64K-512).
- ...

> @@ -964,6 +960,22 @@ cluster_wbuild(struct vnode *vp, long size, daddr_t start_lbn, int len,
> 						bp->b_pages[bp->b_npages] = m;
> 						bp->b_npages++;
> 					}
> +					if (bp->b_npages > vp->v_mount->
> +					    mnt_iosize_max / PAGE_SIZE) {

- ...Then this detects that the 17th page is 1 too many and cleans up.

> +						KASSERT(i != 0, ("XXX"));
> +						j++;
> +						for (jj = 0; jj < j; jj++) {
> +							vm_page_io_finish(tbp->
> +							    b_pages[jj]);
> +						}
> +						vm_object_pip_subtract(
> +						    m->object, j);
> +						bqrelse(tbp);
> +						bp->b_npages -= j;
> +						VM_OBJECT_WUNLOCK(tbp->
> +						    b_bufobj->bo_object);
> +						goto finishcluster;
> +					}
> 				}
> 				VM_OBJECT_WUNLOCK(tbp->b_bufobj->bo_object);
> 			}

I think it would work and fix other bugs to check (tbp->b_bcount +
bp->b_bcount <= vp->v_mount->mnt_iosize_max) up front.  Drivers should
be able to handle an i/o size of b_bcount however many pages that
takes.  There must be a limit on b_pages, but it seems to be
non-critical and the limit on b_bcount gives one of
(mnt_iosize_max / PAGE_SIZE) rounded in some way and possibly increased
by 1 or doubled to account for offsets.  If mnt_iosize_max is not a
multiple of PAGE_SIZE, then the limit using pages doesn't even allow
covering mnt_iosize_max using pages, since the rounding down is
non-null.

I found this bug while debugging a recent PR about bad performance and
hangs under write pressure.  I only have 1 other clearly correct fix
for the bad performance.  msdosfs is missing read clustering for
read-before-write.  I didn't notice that this was necessary when I
implemented clustering for msdosfs a few years ago.  I thought that the
following patch was a complete fix, but have found more performance
problems in clustering:

@ diff -u2 msdosfs_vnops.c~ msdosfs_vnops.c
@ --- msdosfs_vnops.c~	Thu Feb  5 19:11:37 2004
@ +++ msdosfs_vnops.c	Wed Jun 12 04:01:19 2013
@ @@ -740,5 +756,19 @@
@  			 * The block we need to write into exists, so read it in.
@  			 */
@ -			error = bread(thisvp, bn, pmp->pm_bpcluster, cred, &bp);
@ +			if ((ioflag >> IO_SEQSHIFT) != 0 &&

This was cloned from the ffs version.  All ffs should call a common function
instead of duplicating the cluster_read/bread decision.  Similarly for
write clustering except there are more decisions.  But ffs and ext2fs do
this in UFS_BALLOC() and ext2fs_balloc() (?) where not all the info that
might be need is available.

I repeated the (ioflag >> IO_SEQSHIFT) calculation instead of copying to
a variable like ffs does, to localise this patch and to avoid copying
ffs's mounds of style bugs in the declaration of the variable.

@ +			    (vp->v_mount->mnt_flag & MNT_NOCLUSTERR) == 0) {
@ +				error = cluster_read(vp, dep->de_FileSize, bn,
@ +				    pmp->pm_bpcluster, NOCRED,
@ +#if 0
@ +				    (uio->uio_offset & pmp->pm_crbomask) +
@ +				    uio->uio_resid,

This part was copied from msdosfs_read().  msdosfs_read() uses the uio of
some reader.  Here the uio for read-before-write is for the writer.  It
isn't clear that either should be used here.  UFS_BALLOC() is not passed
the full uio info needed to make this caclulation, and it uses the fixed
size MAXBSIZE.  That is wrong in a different way.

This parameter is used to reduce latency.  It asks for a small cluster
of the specified size followed by read ahead of full clusters, with the
amount of read ahead controlled by vfs.read_max.  This gives clusters
of non-uniform sizes and offsets, especially when the reader uses small
blocks.  scottl recently added vfs.read_min which can be tuned to prevent
this.

I don't like many things in this area.  vfs.read_min works OK, but is
another hack.  The default should probably be to optimize for throughput
instead of latency (the reverse of the current one, but the curent one
is historical so it shouldn't be changed).  The units of vfs.read_max
and vfs.read_min are fs block sizes.  This is quite broken when the
cluster size is varied and sometimes small.  E.g., the old default
read_max of 8, with a block size of 512 then the read-ahead is limited
to a whole 4K.  The default is 64, which is still too small with small
block sizes.  But if you increase vfs.read_max a lot, then the
read-ahead becomes almost infinity when the block size is large.  In
my version, the units for vfs.read_max are 512-blocks (default 256 for
the old limit of 128K read-ahead with ffs's old default 16K-blocks.
The current limit of 64 seems excessive with ffs's current default of
32K-blocks (2048K read-ahead).  My units are mostly better, but I
just noticed that they have a different too-delicate interaction with
application and kernel block sizes...

@ +#else
@ +				    MAXPHYS,

The above gave sub-maximal clustering.  So does ffs's MAXBSIZE when
it is smaller than mnt_iosize_max.  In ~5.2, mnt_iosize_max is physical
and is usually DFLTPHYS == MAXBSIZE, so ffs's choice usually gives
maximal clusters.  However, in -current, mnt_iosize_max is virtual
and is usually MAXPHYS == 2 * MAXBSIZE.  So MAXPHYS is probably correct
here.

...  Then I noticed another problem.  MAXPHYS twice mnt_iosize_max,
so the cluster size is only mnt_iosize_max = DFLTPHYS = 64K.  This
apparently acts badly with vfs.read_max = 256 512-blocks.  I think
it breaks read-ahead.  Throughput drops by a factor of 4 for read-before
write relative to direct writes (not counting the factor of 2 for the
doubled i/o from the reads), although all the i/o sizes are 64K.
Increasing vfs.read_max by just 16 fixes this.  The throughput drop
is then only 10-20% (there must be some drop for the extra seeks).
I'm not sure if extra read-ahead is good or bad here.  More read-ahead
in read-before-write reduces seeks, but it may also break drives'
caching and sequential heuristics.  My drives are old and have small
caches and are very sensitive to the i/o pattern for read-before-write.

@ +#endif
@ +				    ioflag >> IO_SEQSHIFT, &bp);
@ +			} else {
@ +				error = bread(vp, bn, pmp->pm_bpcluster,
@ +				    NOCRED, &bp);
@ +			}
@  			if (error) {
@  				brelse(bp);

To complete getting mostly-full clusters for writing large files to msdosfs,
I hacked the block size heuristic some more to give larger blocks:

@ diff -u2 msdosfs_vfsops.c~ msdosfs_vfsops.c
@ --- msdosfs_vfsops.c~	Sun Jun 20 14:20:03 2004
@ +++ msdosfs_vfsops.c	Wed Jun 12 04:32:52 2013
@ @@ -519,7 +547,11 @@
@ 
@  	if (FAT12(pmp))
@ -		pmp->pm_fatblocksize = 3 * pmp->pm_BytesPerSec;
@ +		pmp->pm_fatblocksize = 3 * DEV_BSIZE;
@ +	else if (FAT16(pmp))
@ +		pmp->pm_fatblocksize = PAGE_SIZE;
@  	else
@ -		pmp->pm_fatblocksize = MSDOSFS_DFLTBSIZE;
@ +		pmp->pm_fatblocksize = DFLTPHYS;
@ +	pmp->pm_fatblocksize = roundup(pmp->pm_fatblocksize,
@ +	    pmp->pm_BytesPerSec);
@ 
@  	pmp->pm_fatblocksec = pmp->pm_fatblocksize / DEV_BSIZE;

I changed my version this long ago to use 3*DEV_BSIZE for FAT12 and 
PAGE_SIZE in all other cases.  3*pmp->pm_BytesPerSec is bogus since
IIRC a small file system doesn't need even 3 DEV_BSIZE sectors and
when the sector size is larger than DEV_BSIZE then it won't need 3
sectors.  MSDOSFS_DFLTBSIZE is 4096.  This and PAGE_SIZE are really
too small for huge FATs.  The FAT i/o size should really depend on
the size of the FAT, not on its type, but small sizes are more
robust and more efficient for sparse writes.  The larger size also
requires fewer buffers.  4K is not too bad, but 512 would be really
bad.

I just remembered why I like small blocks.  They are more robust, and
clustering makes them efficient.  But clustering of the FAT isn't done.
Clusters are normally written with bdwrite() but not B_CLUSTEROK.  I
think some clustering still occurs since !B_CLUSTEROK is not honored.
Clusters are read using bread().  I think this is followed breadn(),
giving the old limited read-ahead which isn't nearly enough with
4K-blocks.

DFLTPHYS or MAXPHYS wasn't usable as a default until geom made the max
i/o size virtual, since it didn't work for devices with a lower
limit.  Neither did MAXBSIZE.  Even 4K might be larger than the
device limit.

Bruce