missed clustering for small block sizes in cluster_wbuild()

Thu Jun 6 20:01:30 UTC 2013

Would someone please fix this fairly small bug is cluster_wbuild().

When the fs block size is 512 bytes and everything is contiguous,
cluster_write() prepares a complete cluster but cluster_build()
splits it into two with a runt of size PAGE_SIZE - 512 bytes for
the second part.

>From vfs_cluster.c:

@ 		/*
@ 		 * From this location in the file, scan forward to see
@ 		 * if there are buffers with adjacent data that need to
@ 		 * be written as well.
@ 		 */
@ 		for (i = 0; i < len; ++i, ++start_lbn) {
@ 			if (i != 0) { /* If not the first buffer */
@ 				s = splbio();
@ 				/*
@ 				 * If the adjacent data is not even in core it
@ 				 * can't need to be written.
@ 				 */
@ 				VI_LOCK(vp);
@ 				if ((tbp = gbincore(vp, start_lbn)) == NULL ||
@ 				    (tbp->b_vflags & BV_BKGRDINPROG)) {
@ 					VI_UNLOCK(vp);
@ 					splx(s);
@ 					break;
@ 				}
@ 
@ 				/*
@ 				 * If it IS in core, but has different
@ 				 * characteristics, or is locked (which
@ 				 * means it could be undergoing a background
@ 				 * I/O or be in a weird state), then don't
@ 				 * cluster with it.
@ 				 */
@ 				if (BUF_LOCK(tbp,
@ 				    LK_EXCLUSIVE | LK_NOWAIT | LK_INTERLOCK,
@ 				    VI_MTX(vp))) {
@ 					splx(s);
@ 					break;
@ 				}
@ 
@ 				if ((tbp->b_flags & (B_VMIO | B_CLUSTEROK |
@ 				    B_INVAL | B_DELWRI | B_NEEDCOMMIT))
@ 				    != (B_DELWRI | B_CLUSTEROK |
@ 				    (bp->b_flags & (B_VMIO | B_NEEDCOMMIT))) ||
@ 				    tbp->b_wcred != bp->b_wcred) {
@ 					BUF_UNLOCK(tbp);
@ 					splx(s);
@ 					break;
@ 				}
@ 
@ 				/*
@ 				 * Check that the combined cluster
@ 				 * would make sense with regard to pages
@ 				 * and would not be too large
@ 				 */
@ 				if ((tbp->b_bcount != size) ||
@ 				  ((bp->b_blkno + (dbsize * i)) !=
@ 				    tbp->b_blkno) ||
@ 				  ((tbp->b_npages + bp->b_npages) >
@ 				    (vp->v_mount->mnt_iosize_max / PAGE_SIZE))) {

This page count check is wrong.  Usually mnt_iosize_max is 128K and PAGE_SIZE
is 4K.  This gives a limit of 32 pages, which is normally enough to hold
256 512-blocks (very sparsely and wastefully mapped, but that is another,
much larger bug).  The pages up to the 31st are built up right and hold
248 512-blocks.  Then bp->b_npages is 31 and tbp->npages = 1.  The sum is
32 so the 32nd page is started right.  The 249th 512-block is allocated
in it.  Then for the next 512-block, bp->b_npages is 32 so the sum is 33
so we break out of the loop prematurely.  The buffer containing 249
512-blocks is written out.  cluster_wbuild() continues and builds and
writes out a runt buffer containing the remaining 7 512-blocks.

For a quick fix, I just removed this check.  cluster_write() should
never prepare a cluster too large to write out in 1 step. It uses
mnt_stat.f_iosize instead mnt_iosize_max for the limit, and that
should not be larger.  But there may be a special case where the
buffer contents is not aligned.  Then an extra page might be needed.

@ 					BUF_UNLOCK(tbp);
@ 					splx(s);
@ 					break;
@ 				}
@ 				/*
@ 				 * Ok, it's passed all the tests,
@ 				 * so remove it from the free list
@ 				 * and mark it busy. We will use it.
@ 				 */
@ 				bremfree(tbp);
@ 				tbp->b_flags &= ~B_DONE;
@ 				splx(s);
@ 			} /* end of code for non-first buffers only */
@ 			/* check for latent dependencies to be handled */
@ 			if ((LIST_FIRST(&tbp->b_dep)) != NULL) {
@ 				tbp->b_iocmd = BIO_WRITE;
@ 				buf_start(tbp);
@ 			}
@ 			/*
@ 			 * If the IO is via the VM then we do some
@ 			 * special VM hackery (yuck).  Since the buffer's
@ 			 * block size may not be page-aligned it is possible
@ 			 * for a page to be shared between two buffers.  We
@ 			 * have to get rid of the duplication when building
@ 			 * the cluster.
@ 			 */
@ 			if (tbp->b_flags & B_VMIO) {
@ 				vm_page_t m;
@ 
@ 				if (i != 0) { /* if not first buffer */
@ 					for (j = 0; j < tbp->b_npages; j += 1) {
@ 						m = tbp->b_pages[j];
@ 						if (m->flags & PG_BUSY) {
@ 							bqrelse(tbp);
@ 							goto finishcluster;
@ 						}
@ 					}
@ 				}
@ 				if (tbp->b_object != NULL)
@ 					VM_OBJECT_LOCK(tbp->b_object);
@ 				vm_page_lock_queues();
@ 				for (j = 0; j < tbp->b_npages; j += 1) {
@ 					m = tbp->b_pages[j];
@ 					vm_page_io_start(m);
@ 					vm_object_pip_add(m->object, 1);
@ 					if ((bp->b_npages == 0) ||
@ 					  (bp->b_pages[bp->b_npages - 1] != m)) {
@ 						bp->b_pages[bp->b_npages] = m;
@ 						bp->b_npages++;

The number of new pages needed is the number of increments here, which I
think is 1 less than bp->b_pages in most cases where the runt buffer is
built.

I think this is best fixed be fixed by removing the check above and
checking here.  Then back out of the changes.  I don't know this code
well enough to write the backing out easily.

@ 					}
@ 				}
@ 				vm_page_unlock_queues();
@ 				if (tbp->b_object != NULL)
@ 					VM_OBJECT_UNLOCK(tbp->b_object);
@ 			}
@ 			bp->b_bcount += size;
@ 			bp->b_bufsize += size;
@ 
@ 			s = splbio();
@ 			bundirty(tbp);
@ 			tbp->b_flags &= ~B_DONE;
@ 			tbp->b_ioflags &= ~BIO_ERROR;
@ 			tbp->b_flags |= B_ASYNC;
@ 			tbp->b_iocmd = BIO_WRITE;
@ 			reassignbuf(tbp, tbp->b_vp);	/* put on clean list */
@ 			VI_LOCK(tbp->b_vp);
@ 			++tbp->b_vp->v_numoutput;
@ 			VI_UNLOCK(tbp->b_vp);
@ 			splx(s);
@ 			BUF_KERNPROC(tbp);
@ 			TAILQ_INSERT_TAIL(&bp->b_cluster.cluster_head,
@ 				tbp, b_cluster.cluster_entry);
@ 		}

The loss of i/o performance from this wasn't too bad on my disks.  The
bug mainly made it harder to see the size of other clustering bugs by
watching iostat output.

Bruce