current fd allocation idiom

Fri Jul 18 16:00:05 UTC 2014

On Fri, Jul 18, 2014 at 04:40:12PM +0200, Mateusz Guzik wrote:
> On Fri, Jul 18, 2014 at 04:06:29PM +0300, Konstantin Belousov wrote:
> > On Fri, Jul 18, 2014 at 01:55:38AM +0200, Mateusz Guzik wrote:
> > > The kernel has to deal with concurrent attempts to open, close, dup2 and
> > > use the same file descriptor.
> > > 
> > > I start with stating a rather esoteric bug affecting this area, then I
> > > follow with a short overview of what is happening in general and a
> > > proposal how to change to get rid of the bug and get some additional
> > > enchancements. Interestingly enough turns out Linux is doing pretty much
> > > the same thing.
> > > 
> > > ============================
> > > THE BUG:
> > > /*
> > >  * Initialize the file pointer with the specified properties.
> > >  *
> > >  * The ops are set with release semantics to be certain that the flags, type,
> > >  * and data are visible when ops is.  This is to prevent ops methods from being
> > >  * called with bad data.
> > >  */
> > > void
> > > finit(struct file *fp, u_int flag, short type, void *data, struct fileops *ops)
> > > {
> > >         fp->f_data = data;
> > >         fp->f_flag = flag;
> > >         fp->f_type = type;
> > >         atomic_store_rel_ptr((volatile uintptr_t *)&fp->f_ops, (uintptr_t)ops);
> > > }
> > > 
> > > This itself is fine, but it assumes all code obtaining fp from fdtable places
> > > a read memory barrier after reading f_ops and before reading anything else.
> > > As you could guess no code does that and I don't believe placing rmb's
> > > in several new places is the way to go.
> > I think your analysis is correct for all cases except kern_openat().
> > 
> > For kern_openat(), we install a file into fdtable only after the struct
> > file is fully initialized, see kern_openat(). The file is allocated
> > with falloc_noinstall(), then file is initialized, then finstall() does
> > FILEDESC_LOCK/UNLOCK, which ensures the full barrier. Other file
> > accessors do fget_unlocked(), which does acquire (of fp->f_count, but
> > this does not matter).
> > 
> > The only critical constraint is that other accessors must not see
> > f_ops != badfileops while struct file is not fully visible.  IMO,
> > the full barrier in finstall() and acquire in fget*() guarantee
> > that we see badfileops until writes to other members are visible.
> > 
> 
> I agree, but I doubt everything can be converted to this model (see below).
> 
> > For falloc(), indeed the write to f_ops could become visible too early
> > (but not on x86).
> > 
> > > 
> > > ============================
> > > GENERAL OVERVIEW OF CURRENT STATE:
> > > 
> > > What readers need to do:
> > > - rmb() after reading fp_ops
> > > - check fp_ops for badfileops
> > How can readers see badfileops ?
> > 
> 
> Not sure what you mean. fp is installed with badfileops, anything
> accessing fdtable before finit finishes will see this.
I referenced falloc_noinstall().

> 
> > > 
> > > ============================
> > > PROPOSAL:
> > > 
> > > struct file *fp;
> > > int fd;
> > > 
> > > if (error = falloc(&fp, &fd))
> > Use falloc_noinstall() there.
> > 
> > > 	return (error);
> > > if (error = something(....)) {
> > > 	fdclose(fp, fd);
> > > 	return (error);
> > > }
> > > 
> > > finit(fp, ....);
> > > factivate(fd, fp);
> > This function is spelled finstall().
> > 
> > It seems that all what is needed is conversion of places using
> > falloc() to falloc_noinstall()/finstall().
> > 
> 
> This postpones fd allocation to after interested function did all work
> it wanted to do, which means we would need reliable ways of reverting
> all the work in case allocation failed. I'm not so confident we can do
> that for all current consumers and both current and my proposed approach
> don't impose such requirement.
Cleanup should be identical to the actions done on close(2).

> 
> Of course postponing fd allocation where possible is definitely worth
> doing.
Yes, and after that the rest of the cases should be evaluated.
But my gut feeling is that everything would be converted.

> 
> For cases where it is not possible/feasible, the only problem we have is
> making sure we update fd entry in proper table in factivate.
> 
> The easiest solution is to FILEDESC_XLOCK, but this may have measurable
> impact. We can get away with FILEDESC_SLOCK just fine, but this still
> writes and may ping-pong with other cpus.
> 
> This is another place where we could just plop sequence counters.
> 
> fdgrowtable would +/-:
> seq_write_begin(&fdp->fd_tbl_seq);
> memcpy(....);
> assign the new pointer
> seq_end_begin(&fdp->fd_tbl_seq);
> 
> Then factivate would be +/-:
> 
> do {
> 	fdtable = __READ_ONCE(..., fdp->fd_tbl);
> 	rmb();
> 	seq = seq_read(fdp->fd_tbl_seq);
> 	fdtable[fd] = fp;
> while (!seq_consistent(fdp->fd_tbl_seq, seq));
> 
> This in worst case updates old fdtable (which we never free so it is
> harmless) and retries. seq_read never returns seq in 'modify' state,
> and we check that seq is the same before and after the operation. If the
> condition is met there was no concurrent fdtable copy.
> 
> This also means that table readers can 'suddenly' get fps, but this
> should be fine. NULL fp is used to denote unused fd. If fp is non NULL,
> it is safe for use. If it is NULL, fd is ignored which should not matter.
> 
> Functions like fdcopy just have to make sure they read fp once.
> 
> -- 
> Mateusz Guzik <mjguzik gmail.com>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20140718/c35f3b3c/attachment.sig>