final decision about *at syscalls

Pawel Jakub Dawidek pjd at FreeBSD.org
Mon Apr 14 08:19:38 UTC 2008


On Sat, Apr 12, 2008 at 04:16:04PM +0300, Kostik Belousov wrote:
> On Sat, Apr 12, 2008 at 01:20:19PM +0200, Pawel Jakub Dawidek wrote:
> > On Thu, Dec 20, 2007 at 11:38:55AM -0500, John Baldwin wrote:
> > > On Tuesday 18 December 2007 04:22:22 am Roman Divacky wrote:
> > > > Dear arch@
> > > > 
> > > > Over this summer I was working (among other things) on *at family of syscalls
> > > > kindly sponsored by Google (in their Summer of Code). The resulting patch is 
> > > > almost finished but I need to decide one design question. If you are not interested 
> > > > in *at/namei feel free to skip this mail.
> > > > 
> > > > The *at syscalls are a threads-oriented extension to basic file syscalls (think
> > > > of open(), fstat(), etc.) adding the possibility to specify from where the search
> > > > for relative path should start.
> > > > 
> > > > image that we have /tmp/foo/bar
> > > > 
> > > > and CWD is set to "/tmp/", and the process has opened "foo" as dirfd. with ordinary
> > > > open() syscall you have to either
> > > > 
> > > > 	chdir("/tmp/foo");open("./bar");
> > > > 
> > > > or
> > > > 
> > > > 	open("/tmp/foo/bar");
> > > > 
> > > > The first approach is problematic because it changes CWD for all threads in the process,
> > > > the second is prone to race-conditions as some of the components of the path can
> > > > change in parallel with the "open".
> > > > 
> > > > So POSIX introduced a new API, called "Extended API set part 2, ISBN: 1-931624-67-4" (at
> > > > least this was the latest when I looked last time), which solves that by introducing "*at"
> > > > syscalls that supply an fd of previously opened directory which is used instead of CWD
> > > > for searching relative path, ie. the previous example becomes
> > > > 
> > > >    dirfd = open("/tmp/foo"); openat("foo", dirfd);
> > > > 
> > > > I implemented the whole API as native FreeBSD syscalls + in linuxulator emulation layer.
> > > > Here's the problem:
> > > > 
> > > > There are two approaches to the name translation from "filedescriptor" to the "vnode".
> > > > 
> > > > 1) we can do it in the kern_fooat() syscall and pass namei() the resulting vnode
> > > > 2) we can pass namei() the filedescriptor and do the translation there
> > > > 
> > > > PROs of #1:
> > > > 
> > > > 	o	namei() does not need to know about the curthread, you can use this *at
> > > > 		ability for different purposes, it's cleaner (imho)
> > > > 
> > > > PROs of #2
> > > > 
> > > > 	o	raceless implementation
> > > > 	o	no code duplication
> > > > 
> > > > CONs of #1
> > > > 
> > > > 	o	some very small code duplication (the translation is done in every 
> > > > 		kern_fooat() function)
> > > > 	o	there is a race between the name translation and the actual use of the result
> > > > 		of the translation that needs to be handled, the "path_to_file" string is copied
> > > > 		to the kernel space twice hence a race
> > > > 
> > > > CONs of #2
> > > > 
> > > > 	o	namei is made thread dependant		
> > > > 
> > > > Please tell me what approach you like more. I personally favour #1 because I don't like namei()
> > > > being thread dependant, Kostik Belousov prefers #2.
> > > 
> > > Considering Robert's paper on security race problems in things like systrace
> > > stemming from when you copy parameters out of userland and into the kernel
> > > multiple times, I think #2 is definitely the better choice.  Also, namei() is
> > > already thread aware AFAICT since 'struct componentname' already contains a
> > > 'cnp_thread' member (was 'cnp_proc' in 4.x).
> > 
> > It looks like I'm a bit too late, but anyway...
> > 
> > From what you write John, #1 is a better choice than #2. If you want to
> > avoid races, you can pass already locked vnode. In case of file
> > descriptors, if p_fd is not locked another thread can close and open
> > different directory under the same descriptor number.
> This is the application imposed race, not the externally imposed one.
> Moreover, I would argue that this is application error.

Right, this will be an application bug and the race can occur even
before entering the kernel. What I'm saying is that this is more racy
than vnode approach (at least is not less racy).

> > I also need such functionality for recent ZFS and #2 makes it impossible
> > to use it. NDINIT_AT() is kernel (VFS) API so it should operate on
> > vnodes, not file descriptor numbers, IMHO.
> Following the same arguments, NDINIT() shall not operate on the pathes
> too.

We both know that we have to convert path to vnode somehow, but you
can't disagree that VFS API doesn't operate on file descriptors in
general, but on vnodes.

> > For completness can you Kostik and Robert provide your arguments against
> > #1?
> 
> The #2 was already committed.
> The #1 caused a code duplication that was quite error-prone.
> 
> What are your problems with the #2 ?

Take a look at perforce change @139873. You can find there how I was
using #1.

-- 
Pawel Jakub Dawidek                       http://www.wheel.pl
pjd at FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20080414/2197ee74/attachment.pgp


More information about the freebsd-arch mailing list