mdf at FreeBSD.org
mdf at FreeBSD.org
Fri Apr 15 16:28:03 UTC 2011
On Fri, Apr 15, 2011 at 3:54 AM, Gleb Kurtsou <gleb.kurtsou at gmail.com> wrote:
> On (14/04/2011 15:41), mdf at FreeBSD.org wrote:
>> On Thu, Apr 14, 2011 at 2:36 PM, Gleb Kurtsou <gleb.kurtsou at gmail.com> wrote:
>> > On (14/04/2011 12:35), mdf at FreeBSD.org wrote:
>> >> For work we need a functionality in our filesystem that is pretty much
>> >> like posix_fallocate(2), so we're using the name and I've added a
>> >> default VOP_ALLOCATE definition that does the right, but dumb, thing.
>> >> The most recent mention of this function in FreeBSD was another thread
>> >> lamenting it's failure to exist:
>> >> http://lists.freebsd.org/pipermail/freebsd-ports/2010-February/059268.html
>> >> The attached files are the core of the kernel implementation of the
>> >> syscall and a default VOP for any filesystem not supporting
>> >> VOP_ALLOCATE, which allows the syscall to work as expected but in a
>> >> non-performant manner. I didn't see this syscall in NetBSD or
>> >> OpenBSD, so I plan to add it to the end of our syscall table.
>> >> What I wanted to check with -arch about was:
>> >> 1) is there still a desire for this syscall?
>> > It looks not to play well architecturally with modern COW file systems
>> > like ZFS and HUMMER. So potentially it can be implemented only for UFS.
>> The syscall, or the dumb implementation? I don't see why the syscall
>> itself would be a problem; presumably ZFS can figure out whether an
>> fallocate() block is worth COWing or not...
> It is good to have if there is a chance to get a real implementation for
> UFS. Having only dumb implementation will fool user software that we
> support it.
> As far as I understand ZFS caches large chunk of changes and than writes
> all of them at once. I doubt blocks can be preallocated. You preallocate
> block, it's marked as used in file systems meta data, changes to meta
> data are written to disk -- it results in inconsistency because
> preallocated block is marked as "used" in meta data and thus can't
> be overwritten. I might be absolutely wrong, ZFS experts are
> better answer this. Grepping reveals no fallocate support in ZFS.
>> >> 2) is this naive implementation useful enough to serve as a default
>> >> for all filesystems until someone with more knowledge fills them in?
>> > Maillist ate the patch. Only man page attached.
> What was performance impact on copying large files?
I don't know and I don't care. :-) Specifically, one problem is that
there is no file-system implementation of "copy"; copy is implemented
in userspace with read(2) then write(2).
If the caller says posix_fallocate() then they want blocks. If
copying a large file is slower after that, well, they asked for it.
This implementation meets the spec only, it's not meant to be optimal.
An optimal VOP_WRITE() implementation may check that e.g. the next
block on write is all zero, and so will make a new logical-zero block
in the same manner as VOP_FALLOCATE. This is up to each filesystem.
> I had sparse file support in PEFS implemented similar way.
posix_fallocate() is specifically to *not* have a sparse file.
> Performance was terrible, vm
> and buf caches where saturated first by writing huge chunks of zeros and
> than by mmap'ing and writing actual data. sched_yeld() and/or vnode
> lock/unlock didn't improve interactive performance either.
> Why wouldn't you just call VOP_SETATTR(newsize) in dumb implementation.
> File systems expect files such behavior, cp is using mmap for a while
VOP_SETATTR(newsize) could truncate, if e.g. the file is already large
and sparse and the fallocate(2) was to provide guaranteed storage only
to the first 1MB.
More information about the freebsd-arch