cleaning files beyond EOF

Sat Feb 23 23:54:47 UTC 2013

On Sun, 17 Feb 2013, Konstantin Belousov wrote:

> On Sun, Feb 17, 2013 at 06:01:50PM +1100, Bruce Evans wrote:
>> On Sun, 17 Feb 2013, Konstantin Belousov wrote:
>>
>>> On Sun, Feb 17, 2013 at 11:33:58AM +1100, Bruce Evans wrote:
>>>> I have a (possibly damaged) ffs data block with nonzero data beyond
>>>> EOF.  Is anything responsible for clearing this data when the file
>>>> is mmapped()?
>>>>
>>>> At least old versions of gcc mmap() the file and have a bug checking
>>>> for EOF.  They read the garbage beyond the end and get confused.
>>>
>>> Does the 'damaged' status of the data block mean that it contain the
>>> garbage after EOF on disk ?
>>
>> Yes, it's at most software damage.  I used a broken version of
>> vfs_bio_clrbuf() for a long time and it probably left some unusual
>> blocks.  This matters suprisingly rarely.
> I recently had to modify the vfs_bio_clrbuf().  For me, a bug in the
> function did matter a lot, because the function is used, in particular,
> to clear the indirect blocks.  The bug caused quite random filesystem
> failures until I figured it out.  My version of vfs_bio_clrbuf() is
> at the end of the message, it avoids accessing b_data.

This will take me a long time to understand.  Indirect blocks seemed to
be broken to me too.  clrbuf() in 4.4BSD was just a simple bzero() of
the entire buffer followed by an (IMO bogus) setting of b_resid to 0.
It was used mainly to allocate blocks, including indirect blocks in ffs.
Now most file systems use vfs_bio_clrbuf() instead.  hpfs, ntfs and udf
still use clrbuf(), since they apparently didn't understand FreeBSD APIs
when they were implemented and they are too new to have been changed by
the global sweep to change to vfs_bio_clrbuf().  But vfs_bio_clrbuf()
has obscure semantics which seem to be different from those of clrbuf().
It reduces to clrbuf() in the non-VMIO case.  In the VMIO case it only
clears the previously "invalid" portions of the buffer.  I don't see
why "valid" implies zero, and testing showed that it didn't.  Cases where
the buffer ended up not all zero were rare and seemed to be only for
indrect blocks in ffs.  Also, the complexity of vfs_bio_clrbuf() is
bogus IMO.  The allocation will be followed by a physical write, to
avoiding re-zeroing parts of the buffer is an insignificant optimization
even if these parts are most of the buffer.

>> I forgot to mention that this is with an old version of FreeBSD,
>> where I changed vfs_bio.c a lot but barely touched vm.
>>
>>> UFS uses a small wrapper around vnode_generic_getpages() as the
>>> VOP_GETPAGES(), the wrapping code can be ignored for the current
>>> purpose.
>>>
>>> vnode_generic_getpages() iterates over the the pages after the bstrategy()
>>> and marks the part of the page after EOF valid and zeroes it, using
>>> vm_page_set_valid_range().
>>
>> The old version has a large non-wrapper in ffs, and vnode_generic_getpages()
>> uses vm_page_set_validclean().  Maybe the bug is just in the old
>> ffs_getpages().  It seems to do only DEV_BSIZE'ed zeroing stuff.  It
>> begins with the same "We have to zero that data" code that forms most
>> of the wrapper in the current version.  It normally only returns
>> vnode_pager_generic_getpages() after that if bsize < PAGE_SIZE.
>> However, my version has a variable which I had forgotten about to
>> control this, and the forgotten setting of this variable results in
>> always using vnode_pager_generic_getpages(), as in -current.  I probably
>> copied some fixes in -current for this.  So the bug can't be just in
>> ffs_getpages().
>>
>> The "damaged" block is at the end of vfs_default.c.  The file size is
>> 25 * PAGE_SIZE + 16.  It is in 7 16K blocks, 2 full 2K frags, and 1 frag
>> with 16 bytes valid in it.
> But the ffs_getpages() might be indeed the culprit. It calls
> vm_page_zero_invalid(), which only has DEV_BSIZE granularity. I think
> that ffs_getpages() also should zero the after eof part of the last page
> of the file to fix your damage, since device read cannot read less than
> DEV_BSIZE.

Is that for the old ffs_getpages()?  I think it clears the DEV_BSIZE
sub-blocks in the page starting at the first "invalid" one.  For my
file's layout, that is starting at offset DEV_BSIZE, since the sub-block
at offset 0 has 16 bytes valid in it and of course the whole sub-block
is valid at the VMIO level.  I tried to verify this by checking that
the unzeroed bytes were from offset 16 to 511, but unfortunately the
problem went away soon after I wrote the first mail about this :-).
It was very reproducible until then.  I checked using fsdb that the
data block is still unzeroed.

> diff --git a/sys/ufs/ffs/ffs_vnops.c b/sys/ufs/ffs/ffs_vnops.c
> index ef6194c..4240b78 100644
> --- a/sys/ufs/ffs/ffs_vnops.c
> +++ b/sys/ufs/ffs/ffs_vnops.c
> @@ -844,9 +844,9 @@ static int
> ffs_getpages(ap)
> 	struct vop_getpages_args *ap;
> {
> -	int i;
> 	vm_page_t mreq;
> -	int pcount;
> +	uint64_t size;
> +	int i, pcount;
>
> 	pcount = round_page(ap->a_count) / PAGE_SIZE;
> 	mreq = ap->a_m[ap->a_reqpage];
> @@ -861,6 +861,9 @@ ffs_getpages(ap)
> 	if (mreq->valid) {
> 		if (mreq->valid != VM_PAGE_BITS_ALL)
> 			vm_page_zero_invalid(mreq, TRUE);
> +		size = VTOI(ap->a_vp)->i_size;
> +		if (mreq->pindex == OFF_TO_IDX(size))
> +			pmap_zero_page_area(mreq, size & PAGE_MASK, PAGE_SIZE);
> 		for (i = 0; i < pcount; i++) {
> 			if (i != ap->a_reqpage) {
> 				vm_page_lock(ap->a_m[i]);

I saw you later mail with this fix fixed (clearing PAGE_SIZE is too much).
I think I would write it without the pindex check:

 		off = VTOI(ap->a_vp)->i_size & PAGE_MASK;
 		if (off != 0)
 			pmap_zero_page_area(mreq, off, PAGE_SIZE - off);

>
> On the other hand, it is not clear should we indeed protect against such
> case, or just declare the disk data broken.

Then file systems written by different, less strict implementations of
ffs would not work.  It is unclear if ffs is specified to clear bytes
beyond the end when writing files.  But I think it should clear up to
the next fragment boundary.  After that, it is vm's responsibility to
clear.

Bruce