final decision about *at syscalls

Pawel Jakub Dawidek pjd at FreeBSD.org
Sat Apr 12 11:47:34 UTC 2008


On Thu, Dec 20, 2007 at 11:38:55AM -0500, John Baldwin wrote:
> On Tuesday 18 December 2007 04:22:22 am Roman Divacky wrote:
> > Dear arch@
> > 
> > Over this summer I was working (among other things) on *at family of syscalls
> > kindly sponsored by Google (in their Summer of Code). The resulting patch is 
> > almost finished but I need to decide one design question. If you are not interested 
> > in *at/namei feel free to skip this mail.
> > 
> > The *at syscalls are a threads-oriented extension to basic file syscalls (think
> > of open(), fstat(), etc.) adding the possibility to specify from where the search
> > for relative path should start.
> > 
> > image that we have /tmp/foo/bar
> > 
> > and CWD is set to "/tmp/", and the process has opened "foo" as dirfd. with ordinary
> > open() syscall you have to either
> > 
> > 	chdir("/tmp/foo");open("./bar");
> > 
> > or
> > 
> > 	open("/tmp/foo/bar");
> > 
> > The first approach is problematic because it changes CWD for all threads in the process,
> > the second is prone to race-conditions as some of the components of the path can
> > change in parallel with the "open".
> > 
> > So POSIX introduced a new API, called "Extended API set part 2, ISBN: 1-931624-67-4" (at
> > least this was the latest when I looked last time), which solves that by introducing "*at"
> > syscalls that supply an fd of previously opened directory which is used instead of CWD
> > for searching relative path, ie. the previous example becomes
> > 
> >    dirfd = open("/tmp/foo"); openat("foo", dirfd);
> > 
> > I implemented the whole API as native FreeBSD syscalls + in linuxulator emulation layer.
> > Here's the problem:
> > 
> > There are two approaches to the name translation from "filedescriptor" to the "vnode".
> > 
> > 1) we can do it in the kern_fooat() syscall and pass namei() the resulting vnode
> > 2) we can pass namei() the filedescriptor and do the translation there
> > 
> > PROs of #1:
> > 
> > 	o	namei() does not need to know about the curthread, you can use this *at
> > 		ability for different purposes, it's cleaner (imho)
> > 
> > PROs of #2
> > 
> > 	o	raceless implementation
> > 	o	no code duplication
> > 
> > CONs of #1
> > 
> > 	o	some very small code duplication (the translation is done in every 
> > 		kern_fooat() function)
> > 	o	there is a race between the name translation and the actual use of the result
> > 		of the translation that needs to be handled, the "path_to_file" string is copied
> > 		to the kernel space twice hence a race
> > 
> > CONs of #2
> > 
> > 	o	namei is made thread dependant		
> > 
> > Please tell me what approach you like more. I personally favour #1 because I don't like namei()
> > being thread dependant, Kostik Belousov prefers #2.
> 
> Considering Robert's paper on security race problems in things like systrace
> stemming from when you copy parameters out of userland and into the kernel
> multiple times, I think #2 is definitely the better choice.  Also, namei() is
> already thread aware AFAICT since 'struct componentname' already contains a
> 'cnp_thread' member (was 'cnp_proc' in 4.x).

It looks like I'm a bit too late, but anyway...

From what you write John, #1 is a better choice than #2. If you want to
avoid races, you can pass already locked vnode. In case of file
descriptors, if p_fd is not locked another thread can close and open
different directory under the same descriptor number.

I also need such functionality for recent ZFS and #2 makes it impossible
to use it. NDINIT_AT() is kernel (VFS) API so it should operate on
vnodes, not file descriptor numbers, IMHO.

For completness can you Kostik and Robert provide your arguments against
#1?

-- 
Pawel Jakub Dawidek                       http://www.wheel.pl
pjd at FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20080412/7494528c/attachment.pgp


More information about the freebsd-arch mailing list