f_offset
Jeff Roberson
jroberson at jroberson.net
Sat Apr 12 23:49:56 UTC 2008
So I'm in the midst of working on other filesystem concurrency issues and
that has brought me back around to f_offset again. I'm working on a
method to allow non-overlapping writes and reads to proceed concurrently
to the same file. This means the exclusive vnode lock can not be used to
protect f_offset even in the write case.
To maintain the existing semantics I'm simply going to add an exclusive
sx_xlock() around access to f_offset. This is done inconsistently today
which is fine from the perspective of the updates in most cases being
user-space races. However, f_offset is 64bit and can not be written
atomically on 32bit systems and so requires some extra synchronization
there.
The sx lock will nearly double the size of struct file. Although it's
lost some weight in 8.0 that is quite unfortunate. However, the method of
using LOCKED & WAITING flags, msleep and a mutex has ruined performance in
too many cases to continue using it.
It's worth discussing what posix actually guarantees for f_offset as well
as what other operating systems do. POSIX actually does not guarantee any
behavior with simultaneous access. Multiple readers may read the same
position in the file concurrently and update the position to different
offsets. Multiple writers may write to the same file location, although
the io should be serialized by some other means. Posix allows for and
Solaris, Linux, and historic implementations of f_offset work in the
following way:
off = fp->f_offset;
lock(vnode);
vn_rdwr()
unlock(vnode)
fp->f_offset = uio->uio_offset;
What we implement is much stricter. It is essentially this:
lock(offset);
off = fp->f_offset;
lock(vnode);
vn_rdwr()
unlock(vnode);
fp->f_offset = uio->uio_offset;
unlock(offset);
We provide the following extra guarantees:
1) Multiple readers will never see overlapping segments of the file
2) Multiple writers will never write to overlapping segments of the file
McKusick changed the behavior in 1986, I would guess for an rforked
process. There is some test code in this fairly interesting lkml thread
where they discuss the problem in linux:
http://lkml.org/lkml/2006/4/12/227
Simply having multiple threads write to stdout in a file on linux is
enough to lose or corrupt output.
I believe it is worth retaining the write guarantee. However, I believe
the read guarantees are simply a side-effect of the original
implementation of the write fix.
I will probably commit a patch to add the sx with exclusive behavior to
start. We need to at least protect 64bit access on 32bit machines in
lseek() which we don't today. Beyond that I think we can relax the read
restriction and allow f_offset readers to operate locklessly and only
serialize writers. For this to work it would be nice if we had a MD way
to write 64bits atomically that simply acquired a lock on 32bit platforms
without something like cmpxchg8b. Or on UP just did the write with
interrupts disabled.
Comments?
Thanks,
Jeff
More information about the freebsd-arch
mailing list