missed clustering for small block sizes in cluster_wbuild()

Bruce Evans brde at optusnet.com.au
Thu Jun 6 20:01:30 UTC 2013


Would someone please fix this fairly small bug is cluster_wbuild().

When the fs block size is 512 bytes and everything is contiguous,
cluster_write() prepares a complete cluster but cluster_build()
splits it into two with a runt of size PAGE_SIZE - 512 bytes for
the second part.

>From vfs_cluster.c:

@ 		/*
@ 		 * From this location in the file, scan forward to see
@ 		 * if there are buffers with adjacent data that need to
@ 		 * be written as well.
@ 		 */
@ 		for (i = 0; i < len; ++i, ++start_lbn) {
@ 			if (i != 0) { /* If not the first buffer */
@ 				s = splbio();
@ 				/*
@ 				 * If the adjacent data is not even in core it
@ 				 * can't need to be written.
@ 				 */
@ 				VI_LOCK(vp);
@ 				if ((tbp = gbincore(vp, start_lbn)) == NULL ||
@ 				    (tbp->b_vflags & BV_BKGRDINPROG)) {
@ 					VI_UNLOCK(vp);
@ 					splx(s);
@ 					break;
@ 				}
@ 
@ 				/*
@ 				 * If it IS in core, but has different
@ 				 * characteristics, or is locked (which
@ 				 * means it could be undergoing a background
@ 				 * I/O or be in a weird state), then don't
@ 				 * cluster with it.
@ 				 */
@ 				if (BUF_LOCK(tbp,
@ 				    LK_EXCLUSIVE | LK_NOWAIT | LK_INTERLOCK,
@ 				    VI_MTX(vp))) {
@ 					splx(s);
@ 					break;
@ 				}
@ 
@ 				if ((tbp->b_flags & (B_VMIO | B_CLUSTEROK |
@ 				    B_INVAL | B_DELWRI | B_NEEDCOMMIT))
@ 				    != (B_DELWRI | B_CLUSTEROK |
@ 				    (bp->b_flags & (B_VMIO | B_NEEDCOMMIT))) ||
@ 				    tbp->b_wcred != bp->b_wcred) {
@ 					BUF_UNLOCK(tbp);
@ 					splx(s);
@ 					break;
@ 				}
@ 
@ 				/*
@ 				 * Check that the combined cluster
@ 				 * would make sense with regard to pages
@ 				 * and would not be too large
@ 				 */
@ 				if ((tbp->b_bcount != size) ||
@ 				  ((bp->b_blkno + (dbsize * i)) !=
@ 				    tbp->b_blkno) ||
@ 				  ((tbp->b_npages + bp->b_npages) >
@ 				    (vp->v_mount->mnt_iosize_max / PAGE_SIZE))) {

This page count check is wrong.  Usually mnt_iosize_max is 128K and PAGE_SIZE
is 4K.  This gives a limit of 32 pages, which is normally enough to hold
256 512-blocks (very sparsely and wastefully mapped, but that is another,
much larger bug).  The pages up to the 31st are built up right and hold
248 512-blocks.  Then bp->b_npages is 31 and tbp->npages = 1.  The sum is
32 so the 32nd page is started right.  The 249th 512-block is allocated
in it.  Then for the next 512-block, bp->b_npages is 32 so the sum is 33
so we break out of the loop prematurely.  The buffer containing 249
512-blocks is written out.  cluster_wbuild() continues and builds and
writes out a runt buffer containing the remaining 7 512-blocks.

For a quick fix, I just removed this check.  cluster_write() should
never prepare a cluster too large to write out in 1 step. It uses
mnt_stat.f_iosize instead mnt_iosize_max for the limit, and that
should not be larger.  But there may be a special case where the
buffer contents is not aligned.  Then an extra page might be needed.

@ 					BUF_UNLOCK(tbp);
@ 					splx(s);
@ 					break;
@ 				}
@ 				/*
@ 				 * Ok, it's passed all the tests,
@ 				 * so remove it from the free list
@ 				 * and mark it busy. We will use it.
@ 				 */
@ 				bremfree(tbp);
@ 				tbp->b_flags &= ~B_DONE;
@ 				splx(s);
@ 			} /* end of code for non-first buffers only */
@ 			/* check for latent dependencies to be handled */
@ 			if ((LIST_FIRST(&tbp->b_dep)) != NULL) {
@ 				tbp->b_iocmd = BIO_WRITE;
@ 				buf_start(tbp);
@ 			}
@ 			/*
@ 			 * If the IO is via the VM then we do some
@ 			 * special VM hackery (yuck).  Since the buffer's
@ 			 * block size may not be page-aligned it is possible
@ 			 * for a page to be shared between two buffers.  We
@ 			 * have to get rid of the duplication when building
@ 			 * the cluster.
@ 			 */
@ 			if (tbp->b_flags & B_VMIO) {
@ 				vm_page_t m;
@ 
@ 				if (i != 0) { /* if not first buffer */
@ 					for (j = 0; j < tbp->b_npages; j += 1) {
@ 						m = tbp->b_pages[j];
@ 						if (m->flags & PG_BUSY) {
@ 							bqrelse(tbp);
@ 							goto finishcluster;
@ 						}
@ 					}
@ 				}
@ 				if (tbp->b_object != NULL)
@ 					VM_OBJECT_LOCK(tbp->b_object);
@ 				vm_page_lock_queues();
@ 				for (j = 0; j < tbp->b_npages; j += 1) {
@ 					m = tbp->b_pages[j];
@ 					vm_page_io_start(m);
@ 					vm_object_pip_add(m->object, 1);
@ 					if ((bp->b_npages == 0) ||
@ 					  (bp->b_pages[bp->b_npages - 1] != m)) {
@ 						bp->b_pages[bp->b_npages] = m;
@ 						bp->b_npages++;

The number of new pages needed is the number of increments here, which I
think is 1 less than bp->b_pages in most cases where the runt buffer is
built.

I think this is best fixed be fixed by removing the check above and
checking here.  Then back out of the changes.  I don't know this code
well enough to write the backing out easily.

@ 					}
@ 				}
@ 				vm_page_unlock_queues();
@ 				if (tbp->b_object != NULL)
@ 					VM_OBJECT_UNLOCK(tbp->b_object);
@ 			}
@ 			bp->b_bcount += size;
@ 			bp->b_bufsize += size;
@ 
@ 			s = splbio();
@ 			bundirty(tbp);
@ 			tbp->b_flags &= ~B_DONE;
@ 			tbp->b_ioflags &= ~BIO_ERROR;
@ 			tbp->b_flags |= B_ASYNC;
@ 			tbp->b_iocmd = BIO_WRITE;
@ 			reassignbuf(tbp, tbp->b_vp);	/* put on clean list */
@ 			VI_LOCK(tbp->b_vp);
@ 			++tbp->b_vp->v_numoutput;
@ 			VI_UNLOCK(tbp->b_vp);
@ 			splx(s);
@ 			BUF_KERNPROC(tbp);
@ 			TAILQ_INSERT_TAIL(&bp->b_cluster.cluster_head,
@ 				tbp, b_cluster.cluster_entry);
@ 		}

The loss of i/o performance from this wasn't too bad on my disks.  The
bug mainly made it harder to see the size of other clustering bugs by
watching iostat output.

Bruce


More information about the freebsd-fs mailing list