svn commit: r220791 - in head: lib/libc/sys sys/compat/freebsd32 sys/kern sys/sys

Mon Apr 18 20:14:55 UTC 2011

On Mon, Apr 18, 2011 at 12:28 PM, Kostik Belousov <kostikbel at gmail.com> wrote:
> On Mon, Apr 18, 2011 at 04:32:22PM +0000, Matthew D Fleming wrote:
>> Author: mdf
>> Date: Mon Apr 18 16:32:22 2011
>> New Revision: 220791
>> URL: http://svn.freebsd.org/changeset/base/220791
>>
>> Log:
>>   Add the posix_fallocate(2) syscall.  The default implementation in
>>   vop_stdallocate() is filesystem agnostic and will run as slow as a
>>   read/write loop in userspace; however, it serves to correctly
>>   implement the functionality for filesystems that do not implement a
>>   VOP_ALLOCATE.
>>
>>   Note that __FreeBSD_version was already bumped today to 900036 for any
>>   ports which would like to use this function.
>>
>>   Also reserve space in the syscall table for posix_fadvise(2).

>> +#ifdef __notyet__
>> +     /*
>> +      * Check if the filesystem sets f_maxfilesize; if not use
>> +      * VOP_SETATTR to perform the check.
>> +      */
>> +     error = VFS_STATFS(vp->v_mount, &sfs, td);
>> +     if (error != 0)
>> +             goto out;
>> +     if (sfs.f_maxfilesize) {
>> +             if (offset > sfs.f_maxfilesize || len > sfs.f_maxfilesize ||
>> +                 offset + len > sfs.f_maxfilesize) {
>> +                     error = EFBIG;
>> +                     goto out;
>> +             }
>> +     } else
>> +#endif
>> +     if (offset + len > vap->va_size) {
>> +             VATTR_NULL(vap);
>> +             vap->va_size = offset + len;
>> +             error = VOP_SETATTR(vp, vap, td->td_ucred);
>> +             if (error != 0)
>> +                     goto out;
>> +     }
> I still do not see a reason to do VOP_SETATTR() there. VOP_WRITE() will
> do auto-extend as needed. Also, see below.

The need is, as commented, to return EFBIG when the new file size will
be larger than the FS supports.  Without this code, passing in
something like posix_fallocate(fd, 0, OFF_MAX) will run the filesystem
out of space.

>> +
>> +     while (len > 0) {
>> +             if (should_yield()) {
>> +                     VOP_UNLOCK(vp, 0);
>> +                     locked = 0;
>> +                     kern_yield(-1);
> Please note that, despite dropping the vnode lock, the snapshot creation
> is still blocked at this point, due to previous vn_start_write().
>
> If doing vn_finished_write() there, then bwillwrite() before
> next iteration is desired.
>> +                     error = vn_lock(vp, LK_EXCLUSIVE);
>> +                     if (error != 0)
>> +                             break;
>> +                     locked = 1;
>> +                     error = VOP_GETATTR(vp, vap, td->td_ucred);
>> +                     if (error != 0)
>> +                             break;
>> +             }
>> +
>> +             /*
>> +              * Read and write back anything below the nominal file
>> +              * size.  There's currently no way outside the filesystem
>> +              * to know whether this area is sparse or not.
>> +              */
>> +             cur = iosize;
>> +             if ((offset % iosize) != 0)
>> +                     cur -= (offset % iosize);
>> +             if (cur > len)
>> +                     cur = len;
>> +             if (offset < vap->va_size) {
>> +                     aiov.iov_base = buf;
>> +                     aiov.iov_len = cur;
>> +                     auio.uio_iov = &aiov;
>> +                     auio.uio_iovcnt = 1;
>> +                     auio.uio_offset = offset;
>> +                     auio.uio_resid = cur;
>> +                     auio.uio_segflg = UIO_SYSSPACE;
>> +                     auio.uio_rw = UIO_READ;
>> +                     auio.uio_td = td;
>> +                     error = VOP_READ(vp, &auio, 0, td->td_ucred);
>> +                     if (error != 0)
>> +                             break;
>> +                     if (auio.uio_resid > 0) {
>> +                             bzero(buf + cur - auio.uio_resid,
>> +                                 auio.uio_resid);
>> +                     }
>> +             } else {
>> +                     bzero(buf, cur);
>> +             }
> Wouldn't VOP_SETATTR() at the start of the function mostly prevent
> this bzero from executing ?

Yes.  If struct statfs had a member indicating the file system's max
file size, then the extend wouldn't be necessary.  We have that
feature locally, but it's only implemented for ufs and our custom file
system, and it requires an ABI change so it's a bit of work to
upstream.  And as with most of those things, it's hard to find the
time to upstream it outside of work hours.

> I estimated what it would take to do the optimized implementation for UFS,
> and I think that the following change would allow to lessen the code
> duplication much.
>
> What if the vnode lock drop and looping be handled by the syscall, instead
> of the vop implementation ? In other words, allow the VOP_ALLOCATE()
> to  allocate less then requested, and return the allocated amount to
> the caller. The loop would be centralized then, freeing fs from doing
> the dance. Also, if fs considers that suitable, it would do a whole
> allocation in one run.

I like this idea.  Perhaps input to the vop should be pointers to
offset and len, and the vop can change them as it iterates? Otherwise
the return code must be overloaded to distinguish between an error
code and the len that has been handled so far.  And, I think the VOP
interface must return int, so there would be no way to indicate that
more than 2GB had been allocated.

So something in the kernel like:
	while ((error = VOP_ALLOCATE(vp, &offset, &len)) == EAGAIN) {
		VOP_UNLOCK(vp, 0);
		/* XXX unlock other resources */
		maybe_yield();
		bwillwrite();
		/* XXX vn_start_write dance */
		error = VOP_LOCK(vp, LK_EXCLUSIVE);
	}

Cheers,
matthew