posix_fallocate(2)

Fri Apr 15 16:22:19 UTC 2011

2011/4/15 Kostik Belousov <kostikbel at gmail.com>:
> On Thu, Apr 14, 2011 at 03:41:30PM -0700, mdf at freebsd.org wrote:
>> On Thu, Apr 14, 2011 at 2:36 PM, Gleb Kurtsou <gleb.kurtsou at gmail.com> wrote:
>> > On (14/04/2011 12:35), mdf at FreeBSD.org wrote:
>> >> For work we need a functionality in our filesystem that is pretty much
>> >> like posix_fallocate(2), so we're using the name and I've added a
>> >> default VOP_ALLOCATE definition that does the right, but dumb, thing.
>> >>
>> >> The most recent mention of this function in FreeBSD was another thread
>> >> lamenting it's failure to exist:
>> >> http://lists.freebsd.org/pipermail/freebsd-ports/2010-February/059268.html
>> >>
>> >> The attached files are the core of the kernel implementation of the
>> >> syscall and a default VOP for any filesystem not supporting
>> >> VOP_ALLOCATE, which allows the syscall to work as expected but in a
>> >> non-performant manner.  I didn't see this syscall in NetBSD or
>> >> OpenBSD, so I plan to add it to the end of our syscall table.
>> >>
>> >> What I wanted to check with -arch about was:
>> >>
>> >> 1) is there still a desire for this syscall?
>> > It looks not to play well architecturally with modern COW file systems
>> > like ZFS and HUMMER. So potentially it can be implemented only for UFS.
>>
>> The syscall, or the dumb implementation?  I don't see why the syscall
>> itself would be a problem; presumably ZFS can figure out whether an
>> fallocate() block is worth COWing or not...
>>
>> >> 2) is this naive implementation useful enough to serve as a default
>> >> for all filesystems until someone with more knowledge fills them in?
>> > Maillist ate the patch. Only man page attached.
>>
>> Whoops!
>>
>> http://people.freebsd.org/~mdf/bsd-fallocate.diff
>
> New syscall symbols for 9.0 should go in under FBSD_1.2 version, not FBSD_1.0.

Okay, fixed.

> You have inconsistent spacing in the kern_posix_fallocate().

Oops; copy/paste error; fixed.

> I do not quite understand the locking for vnode you did.
> You marked the vop as taking and returning unlocked vnode. But, you
> do call VOP_GETATTR in the vop std implementation before locking the vnode.
> Did you tested with DEBUG_VFS_LOCKS config ?

I have mostly tested on the version of FreeBSD we run at work which
has some small KPI modifications.  I will test and fix up on CURRENT
once I figure out prove(1).  As for locking:

(1) For $WORK FreeBSD's locking of a "File" is problematic since we
have both an inode lock and a data lock, and lots of times we don't
really need the inode locked exclusively, just the data, which we
handle inside the VOP.

(2) I don't want to make 1TB allocated in a single operation, under a
single lock, so the implementation is responsible for unlocking and
taking a breather as needed.

(3) I based the VOP_GETATTR on vn_stat which calls VOP_GETATTR without
any lock.  Except, hmm, it looks like vn_statfile(9) takes the lock.
I was trying to avoid a lock/unlock cycle when the file didn't need to
be extended, but I can put it back in.

> Usual (and proper) practice is to have such vop require locked vnode, in
> case of VOP_ALLOCATE, exclusive lock is appropriate. The Giant dance and
> vn_start_write() + vn_lock() go into kern_posix_fallocate() then.
> Also, you should call bwillwrite() before taking any vfs locks.
>
> Is locking/unlocking the vnode in loop is done to allow other callers
> to perform i/o on the vnode in between ? In particular, to truncate it ?
> I think this is not needed, and previous suggestion would take care of it.

See above; it is not acceptable in my mind to lock the vnode for the
entire length of the operation, so the locking is managed by the VOP.

> Why do you need stdallocate_extend() ? VOP_WRITE does the right thing
> with extending the vnode.

I was trying to simplify the implementation to a easy read/write loop
since it isn't supposed to be performant but just get the right data.
I could instead VOP_GETATTR on each loop to check file size and write
zeros past the current file size, but that was more logic than a
single VOP_SETATTR followed by read/write.

> You might find vn_rdwr easier to use then the bare vops. In particular,
> it would not omit the mac calls for read/write.

I checked for write already in kern_posix_fallocate().  A single check
should be sufficient.

For other threads, please note I don't know anything about UFS
implementations and I can't provide a ufs_allocte() that does rapid
allocation of logically zero blocks.  My intent is to provide the
framework, a default implementation that meets the spec'd behaviour,
and a set of testcases suitable to run for any filesystem that wants
to verify their implementation.

Thanks,
matthew