f_offset

Sat Apr 12 23:49:56 UTC 2008

So I'm in the midst of working on other filesystem concurrency issues and 
that has brought me back around to f_offset again.  I'm working on a 
method to allow non-overlapping writes and reads to proceed concurrently 
to the same file.  This means the exclusive vnode lock can not be used to 
protect f_offset even in the write case.

To maintain the existing semantics I'm simply going to add an exclusive 
sx_xlock() around access to f_offset.  This is done inconsistently today 
which is fine from the perspective of the updates in most cases being 
user-space races.  However, f_offset is 64bit and can not be written 
atomically on 32bit systems and so requires some extra synchronization 
there.

The sx lock will nearly double the size of struct file.  Although it's 
lost some weight in 8.0 that is quite unfortunate.  However, the method of 
using LOCKED & WAITING flags, msleep and a mutex has ruined performance in 
too many cases to continue using it.

It's worth discussing what posix actually guarantees for f_offset as well 
as what other operating systems do.  POSIX actually does not guarantee any 
behavior with simultaneous access.  Multiple readers may read the same 
position in the file concurrently and update the position to different 
offsets.  Multiple writers may write to the same file location, although 
the io should be serialized by some other means.  Posix allows for and 
Solaris, Linux, and historic implementations of f_offset work in the 
following way:

off = fp->f_offset;
lock(vnode);
vn_rdwr()
unlock(vnode)
fp->f_offset = uio->uio_offset;

What we implement is much stricter.  It is essentially this:
lock(offset);
off = fp->f_offset;
lock(vnode);
vn_rdwr()
unlock(vnode);
fp->f_offset = uio->uio_offset;
unlock(offset);

We provide the following extra guarantees:
1)  Multiple readers will never see overlapping segments of the file
2)  Multiple writers will never write to overlapping segments of the file

McKusick changed the behavior in 1986, I would guess for an rforked 
process.  There is some test code in this fairly interesting lkml thread 
where they discuss the problem in linux:

http://lkml.org/lkml/2006/4/12/227

Simply having multiple threads write to stdout in a file on linux is 
enough to lose or corrupt output.

I believe it is worth retaining the write guarantee.  However, I believe 
the read guarantees are simply a side-effect of the original 
implementation of the write fix.

I will probably commit a patch to add the sx with exclusive behavior to 
start.  We need to at least protect 64bit access on 32bit machines in 
lseek() which we don't today.  Beyond that I think we can relax the read 
restriction and allow f_offset readers to operate locklessly and only 
serialize writers.  For this to work it would be nice if we had a MD way 
to write 64bits atomically that simply acquired a lock on 32bit platforms 
without something like cmpxchg8b.  Or on UP just did the write with 
interrupts disabled.

Comments?

Thanks,
Jeff