(in)appropriate uses for MAXBSIZE

Wed Apr 14 08:10:25 UTC 2010

on 14/04/2010 09:38 Bruce Evans said the following:
> On Mon, 12 Apr 2010, Andriy Gapon wrote:
> 
>> on 11/04/2010 05:56 Bruce Evans said the following:
>>> On Fri, 9 Apr 2010, Andriy Gapon wrote:
>> [snip]
>>>> I have lightly tested this under qemu.
>>>> I used my avgfs:) modified to issue 4*MAXBSIZE bread-s.
>>>> I removed size > MAXBSIZE check in getblk (see a parallel thread
>>>> "panic: getblk:
>>>> size(%d) > MAXBSIZE(%d)").
>>>
>>> Did you change the other known things that depend on this?  There is the
>>> b_pages limit of MAXPHYS bytes which should be checked for in another
>>> way
>>
>> I changed the check the way I described in the parallel thread.
> 
> I didn't notice anything there about checking MAXPHYS instead of MAXBSIZE.
> Was an explicit check needed?  (An implicit check would probably have
> worked: most clients were limited by the MAXBSIZE check, and the pbuf
> client always uses MAXPHYS or DFLTPHYS.)

I added this:

--- a/sys/kern/vfs_bio.c
+++ b/sys/kern/vfs_bio.c
@@ -2541,8 +2541,8 @@ getblk(struct vnode * vp, daddr_t blkno, int size, int
slpflag, int slptimeo,

 	CTR3(KTR_BUF, "getblk(%p, %ld, %d)", vp, (long)blkno, size);
 	ASSERT_VOP_LOCKED(vp, "getblk");
-	if (size > MAXBSIZE)
-		panic("getblk: size(%d) > MAXBSIZE(%d)\n", size, MAXBSIZE);
+	if (size > MAXPHYS)
+		panic("getblk: size(%d) > MAXPHYS(%d)\n", size, MAXPHYS);

 	bo = &vp->v_bufobj;
 loop:

It wasn't really needed 'by default' but, as I said, I use my own "filesystem"
for testing and in it I do all kinds of nasty things like huge bread()-s.
So I had to add the check to get a nice panic instead of a crash and/or
corruption a little bit later.

>>> and the soft limits for hibufspace and lobufspace which only matter
>>> under load conditions.
>>
>> And what these should be?
>> hibufspace and lobufspace seem to be auto-calculated.  One thing that
>> I noticed
>> and that was a direct cause of the problem described below, is that
>> difference
>> between hibufspace and lobufspace should be at least the maximum block
>> size
>> allowed in getblk() (perhaps it should be strictly equal to that value?).
>> So in my case I had to make that difference MAXPHYS.
> 
> Hard to say.  They are mostly only heuristics which mostly only matter
> under
> heavy loads.  You can change the defaults using sysctl but it is even
> harder
> to know what changes might be good without knowing the details of the
> implementation.

Yes.
I ended up with this change:
--- a/sys/kern/vfs_bio.c
+++ b/sys/kern/vfs_bio.c
@@ -613,7 +613,7 @@ bufinit(void)
 	 */
 	maxbufspace = (long)nbuf * BKVASIZE;
 	hibufspace = lmax(3 * maxbufspace / 4, maxbufspace - MAXBSIZE * 10);
-	lobufspace = hibufspace - MAXBSIZE;
+	lobufspace = hibufspace - MAXPHYS;

 	lorunningspace = 512 * 1024;
 	hirunningspace = 1024 * 1024;

Otherwise, in situation where we need a buffer of size 'size' and buffspace <
lobufspace but buffspace + size > hibufspace, logic in getnewbuf() falls through
the cracks.  MAXBSIZE => MAXPHYS change is to reflect that we support bread()-s
as large as that.

>>>> And I bumped MAXPHYS to 1MB.
>>>> ...
>>>> But I observed reading process (dd bs=1m on avgfs) spending a lot of
>>>> time sleeping
>>>> on needsbuffer in getnewbuf.  needsbuffer value was VFS_BIO_NEED_ANY.
>>>> Apparently there was some shortage of free buffers.
>>>> Perhaps some limits/counts were incorrectly auto-tuned.
>>>
>>> This is not surprising, since even 64K is 4 times too large to work
>>> well.  Buffer sizes of larger than BKVASIZE (16K) always cause
>>> fragmentation of buffer kva.  ...
>>
>> So, BKVASIZE is the best read size from the point of view of buffer
>> space usage?
> 
> It is the best buffer size, which is almost independent of the best read
> size.  First, userland reads will be re-blocked into file-system-block-size
> reads...

Umm, I meant 'bread() size', sorry for not being explicit.
It doesn't make much sense to talk about userland in this context.

>> E.g. a single MAXBSIZE=64K read results in a single 64K GEOM read
>> requests, but
>> leads to buffer space map fragmentation, because of size > BKVASIZE.
>> On the other hand, four sequential reads of BKVASIZE=16K bytes are
>> perfect from
>> buffer space point of view (no fragmentation potential) but they
>> result in 4 GEOM
> 
> Clustering occurs above geom, so geom only sees small requests for small
> files, random accesses, and buggy cases for sequential accesses to large
> files where the bugs give partial randomness.

Yes, provided that cluster API is used.
Which is the case for most places in most filesystems, but not 'all in all'.
E.g. my own "filesystem", which I mention only for a joke,.
And, e.g. msdosfs_readdir, which bread()-s the whole FAT cluster in one go, and,
as we learned, that could be a lot (if sector size is large).

> E.g., a single 64K read from userland normally gives 4 16K ffs blocks
> in the buffer cache.  Clustering turns these into 1 128K block in a
> pbuf (64K for the amount read now and 128K for read-ahead; there may
> be more read-ahead but it would go in another pbuf).

Indeed.  I missed the fact that cluster I/O uses different kind of buffers from
different space.  Those are uniformly sized with MAXPHYS.

> geom then sees
> the 128K (MAXPHYS) block.  Most device drivers still only support i/o's
> of size <= DFLTPHYS, but geom confuses the clustering code into producing
> clusters larger than its intended maximum of what the device supports
> by advertising support for MAXPHYS (v_mount->mnt_iosize_max).

Oh, I missed this.  GEOM setting always si_iosize_max to MAXPHYS seems like a
bug.  Actual hardware/driver capabilities need to be honored.

> So geom
> normally turns the 128K request into 2 64K requests.  Clustering
> finishes by converting the 128K request into 8 16K requests (4 for use
> now and 4 later for read-ahead).
> 
> OTOH, the first block of 4 sequential reads of 16K produces the same
> 128K block at the geom level, modulo bugs in the read-ahead.  This now
> consists 1 and 7 blocks of normal read and read-ahead, respectively,
> instead of 4 and 4.  Then the next 3 blocks are found in the buffer
> cache as read-ahead instead of read from the disk (actually, this is
> insignificantly different from the first case after ffs splits up the
> 64K into 4 times 16K).
> 
> So the block size makes almost no difference at the syscall level
> (512-blocks take significantly more CPU but improve latency, while
> hige blocks take significantly less CPU but significantly unimprove
> latency).
> 
> The file system block size makes only secondary differences:
> - clustering only works to turn small logical i/o's into large physical
>   ones when sequential blocks are allocated sequentially, but always
>   allocating blocks sequentially is hard to do and using large file
>   system blocks reduces the loss when the allocation is not sequential
> - large file system blocks also reduce the amount of work that clustering
>   has to do to reblock.  This benefit is much smaller than the previous
>   one.
> - the buffer cache is only designed to handle medium-sized blocks well.
>   With 512-blocks, it can only hold 1/32 as much as with 16K-blocks,
>   so it will thrash 32 times as much with the former.  Now that the
>   thrashing is to VMIO instead of to the disk, this only wastes CPU.
>   With any block size larger than BKVASIZE, the buffer cache may become
>   fragmented, depending on the combination of block sizes.  Mixed
>   combinations are the worst, and the system doesn't do anything to
>   try to avoid them.  The worst case is a buffer cache full of 512-blocks,
>   with getblk() wanting to allocate a 64K-block.  Then it needs to
>   wait for 32 contiguous blocks to become free, or forcibly free some,
>   or move some...

Agree.

>> I/O requests.
>> The thing is that a single read requires a single contiguous virtual
>> address space
>> chunk.  Would it be possible to take the best of both worlds by
>> somehow allowing a
>> single large I/O request to work with several buffers (with b_kvasize
>> == BKVASIZE)
>> in a iovec-like style?
>> Have I just reinvented bicycle? :)
>> Probably not, because an answer to my question is probably 'not
>> (without lots of
>> work in lots of places)' as well.
> 
> Separate buffers already partly provided, this, and combined with command
> queuing in the hardware they provided it completely in perhaps a better
> way than can be done in software.
> 
> vfs clustering attempts much less but still complicated.  It mainly wants
> to convert buffers that have contiguous disk addresses into a super-buffer
> that has contiguous virtual memory and combine this with read-ahead, to
> reduce the number of i/o's.  All drives less than 10 years old benefit
> only marginally from this, since the same cases that vfs clustering can
> handle are also easy for drive clustering, caching and read-ahead/write-
> behind (especially the latter) to handle even better, so I occasionally
> try turning off vfs clustering to see if it makes a difference;
> unfortunately it still seems to help on all drives, including even
> reducing total CPU usage despite its own large CPU usage.

I think that this mainly tells that our code doesn't optimally handle
non-cluster I/O.  For example, all calls of breadn() that I see specify only one
read-ahead block.  And all bread*()-s are, of course, have fs block size.
So, with e.g. typical 8KB blocks we gets lots of GEOM level and hardware level
I/O going back and forth.
While, as you say, the disks may handle that well, it is not optimal for
communication between disks and controllers, and it's definitely bad for GEOM
and drivers layer.
Essentially, what you wrote in the paragraph below :-)

>> I see that breadn() certainly doesn't work that way.  As I understand,
>> it works
>> like bread() for one block plus starts something like 'asynchronous
>> breads()' for
>> a given count of other blocks.
> 
> Usually breadn() isn't called, but clustering reads to the end of the
> current
> cluster or maybe the next cluster.  breadn() was designed when reading
> ahead a single cluster was helpful.  Now, drives read-ahead a whole track
> or similar probably hundreds of sectors, so reading ahead a single sector
> is almost useless.  It doesn't even reduce the number of i/o's unless it is
> clustered with the previous i/o.

Yes.  (I should have read the whole email before starting my reply).
Perhaps,  it makes sense now to change breadn() interface and turn it into a
simple/cheaper version of cluster_read().  I don't really see much point in
passing an array of block numbers and block sizes.
E.g. perhaps something like:
breadn(vp, startblock, blocksize, count, &bp)
and a filesystem must ensure that all the requested blocks are contiguous.
Then breadn() could read and read-ahead those blocks using optimal block size,
not the fs block size.

>> I am not sure about details of how cluster_read() works, though.
>> Could you please explain the essence of it?
> 
> See above.  It is essentially the old hack of reading ahead a whole
> track in software, done in a sophisticated way but with fewer attempts
> to satisfy disk geometry timing requirements.  Long ago, everything
> was so slow that sequential reads done from userland could not keep
> up with even a floppy disk, but sequential i/o's done from near the
> driver could, even with i/o's of only 1 sector.  I've only ever seen
> this working well for floppy disks.  For hard disks, the i/o's need
> to be multi-sector, and needed to be related to the disk geometry
> (handle full tracks and don't keep results from intermediate sectors
> that are not needed yet iff doing so wouldn't thrash the cache).  Now,
> it is unreasonable to try to know the disk geometry, and vfs clustering
> doesn't try.  Fortunately, this is not much needed, since newer drives
> have their own track caches which, although they don't form a complete
> replacement for vfs clustering (see above), they reduces the losses
> to extra non-physical reads.  Similarly for another problem with vfs:
> all buffers and their clustering are file (vnode) based, which almost
> forces missing intermediate sectors when reading a file, but a working
> device (track or similar) in the drive mostly compensates for not having
> one in the OS.

Thank you for the explanation.
I mostly wondered how clustering worked with buffer cache, somehow I was
overlooking the whole pbuf thing and couldn't place all the pieces together.


-- 
Andriy Gapon