final decision about *at syscalls

Kostik Belousov kostikbel at gmail.com
Mon Apr 14 11:52:39 UTC 2008


[I corrected Robert email, it seems that he still does not have an
account on the FreeBSD.garage.freebsd.pl]

On Mon, Apr 14, 2008 at 10:19:18AM +0200, Pawel Jakub Dawidek wrote:
> On Sat, Apr 12, 2008 at 04:16:04PM +0300, Kostik Belousov wrote:
> > On Sat, Apr 12, 2008 at 01:20:19PM +0200, Pawel Jakub Dawidek wrote:
> > > On Thu, Dec 20, 2007 at 11:38:55AM -0500, John Baldwin wrote:
> > > > On Tuesday 18 December 2007 04:22:22 am Roman Divacky wrote:
> > > > > Dear arch@
> > > > > 
> > > > > Over this summer I was working (among other things) on *at family of syscalls
> > > > > kindly sponsored by Google (in their Summer of Code). The resulting patch is 
> > > > > almost finished but I need to decide one design question. If you are not interested 
> > > > > in *at/namei feel free to skip this mail.
> > > > > 
> > > > > The *at syscalls are a threads-oriented extension to basic file syscalls (think
> > > > > of open(), fstat(), etc.) adding the possibility to specify from where the search
> > > > > for relative path should start.
> > > > > 
> > > > > image that we have /tmp/foo/bar
> > > > > 
> > > > > and CWD is set to "/tmp/", and the process has opened "foo" as dirfd. with ordinary
> > > > > open() syscall you have to either
> > > > > 
> > > > > 	chdir("/tmp/foo");open("./bar");
> > > > > 
> > > > > or
> > > > > 
> > > > > 	open("/tmp/foo/bar");
> > > > > 
> > > > > The first approach is problematic because it changes CWD for all threads in the process,
> > > > > the second is prone to race-conditions as some of the components of the path can
> > > > > change in parallel with the "open".
> > > > > 
> > > > > So POSIX introduced a new API, called "Extended API set part 2, ISBN: 1-931624-67-4" (at
> > > > > least this was the latest when I looked last time), which solves that by introducing "*at"
> > > > > syscalls that supply an fd of previously opened directory which is used instead of CWD
> > > > > for searching relative path, ie. the previous example becomes
> > > > > 
> > > > >    dirfd = open("/tmp/foo"); openat("foo", dirfd);
> > > > > 
> > > > > I implemented the whole API as native FreeBSD syscalls + in linuxulator emulation layer.
> > > > > Here's the problem:
> > > > > 
> > > > > There are two approaches to the name translation from "filedescriptor" to the "vnode".
> > > > > 
> > > > > 1) we can do it in the kern_fooat() syscall and pass namei() the resulting vnode
> > > > > 2) we can pass namei() the filedescriptor and do the translation there
> > > > > 
> > > > > PROs of #1:
> > > > > 
> > > > > 	o	namei() does not need to know about the curthread, you can use this *at
> > > > > 		ability for different purposes, it's cleaner (imho)
> > > > > 
> > > > > PROs of #2
> > > > > 
> > > > > 	o	raceless implementation
> > > > > 	o	no code duplication
> > > > > 
> > > > > CONs of #1
> > > > > 
> > > > > 	o	some very small code duplication (the translation is done in every 
> > > > > 		kern_fooat() function)
> > > > > 	o	there is a race between the name translation and the actual use of the result
> > > > > 		of the translation that needs to be handled, the "path_to_file" string is copied
> > > > > 		to the kernel space twice hence a race
> > > > > 
> > > > > CONs of #2
> > > > > 
> > > > > 	o	namei is made thread dependant		
> > > > > 
> > > > > Please tell me what approach you like more. I personally favour #1 because I don't like namei()
> > > > > being thread dependant, Kostik Belousov prefers #2.
> > > > 
> > > > Considering Robert's paper on security race problems in things like systrace
> > > > stemming from when you copy parameters out of userland and into the kernel
> > > > multiple times, I think #2 is definitely the better choice.  Also, namei() is
> > > > already thread aware AFAICT since 'struct componentname' already contains a
> > > > 'cnp_thread' member (was 'cnp_proc' in 4.x).
> > > 
> > > It looks like I'm a bit too late, but anyway...
> > > 
> > > From what you write John, #1 is a better choice than #2. If you want to
> > > avoid races, you can pass already locked vnode. In case of file
> > > descriptors, if p_fd is not locked another thread can close and open
> > > different directory under the same descriptor number.
> > This is the application imposed race, not the externally imposed one.
> > Moreover, I would argue that this is application error.
> 
> Right, this will be an application bug and the race can occur even
> before entering the kernel. What I'm saying is that this is more racy
> than vnode approach (at least is not less racy).
> 
> > > I also need such functionality for recent ZFS and #2 makes it impossible
> > > to use it. NDINIT_AT() is kernel (VFS) API so it should operate on
> > > vnodes, not file descriptor numbers, IMHO.
> > Following the same arguments, NDINIT() shall not operate on the pathes
> > too.
> 
> We both know that we have to convert path to vnode somehow, but you
> can't disagree that VFS API doesn't operate on file descriptors in
> general, but on vnodes.
> 
> > > For completness can you Kostik and Robert provide your arguments against
> > > #1?
> > 
> > The #2 was already committed.
> > The #1 caused a code duplication that was quite error-prone.
> > 
> > What are your problems with the #2 ?
> 
> Take a look at perforce change @139873. You can find there how I was
> using #1.

Lets distinguish between the desired KPI and the *at implementation.

Would the patch below do what you want ? NDINIT_ATVP takes the vnode
that is assumed to be referenced, and unconditionally consumes the
reference. The vnode is taken for the start of the path translation iff
the path is relative.

Untested.

diff --git a/sys/kern/vfs_lookup.c b/sys/kern/vfs_lookup.c
index 0185158..c8a9860 100644
--- a/sys/kern/vfs_lookup.c
+++ b/sys/kern/vfs_lookup.c
@@ -193,14 +193,21 @@ namei(struct nameidata *ndp)
 	ndp->ni_rootdir = fdp->fd_rdir;
 	ndp->ni_topdir = fdp->fd_jdir;
 
-	if (cnp->cn_pnbuf[0] != '/' && ndp->ni_dirfd != AT_FDCWD) {
-		error = fgetvp(td, ndp->ni_dirfd, &dp);
-		FILEDESC_SUNLOCK(fdp);
-		if (error == 0 && dp->v_type != VDIR) {
-			vfslocked = VFS_LOCK_GIANT(dp->v_mount);
-			vrele(dp);
-			VFS_UNLOCK_GIANT(vfslocked);
-			error = ENOTDIR;
+	dp = NULL;
+	if (cnp->cn_pnbuf[0] != '/') {
+		if (ndp->ni_startdir != NULL) {
+			dp = ndp->ni_startdir;
+			error = 0;
+		} else if (ndp->ni_dirfd != AT_FDCWD)
+			error = fgetvp(td, ndp->ni_dirfd, &dp);
+		if (error != 0 || dp != NULL) {
+			FILEDESC_SUNLOCK(fdp);
+			if (error == 0 && dp->v_type != VDIR) {
+				vfslocked = VFS_LOCK_GIANT(dp->v_mount);
+				vrele(dp);
+				VFS_UNLOCK_GIANT(vfslocked);
+				error = ENOTDIR;
+			}
 		}
 		if (error) {
 			uma_zfree(namei_zone, cnp->cn_pnbuf);
@@ -210,10 +217,16 @@ namei(struct nameidata *ndp)
 #endif
 			return (error);
 		}
-	} else {
+	}
+	if (dp == NULL) {
 		dp = fdp->fd_cdir;
 		VREF(dp);
 		FILEDESC_SUNLOCK(fdp);
+		if (ndp->ni_startdir != NULL) {
+			vfslocked = VFS_LOCK_GIANT(ndp->ni_startdir->v_mount);
+			vrele(ndp->ni_startdir);
+			VFS_UNLOCK_GIANT(vfslocked);
+		}
 	}
 	vfslocked = VFS_LOCK_GIANT(dp->v_mount);
 	for (;;) {
diff --git a/sys/sys/namei.h b/sys/sys/namei.h
index 121f014..5d1c1ea 100644
--- a/sys/sys/namei.h
+++ b/sys/sys/namei.h
@@ -150,14 +150,19 @@ struct nameidata {
  * Initialization of a nameidata structure.
  */
 #define	NDINIT(ndp, op, flags, segflg, namep, td)	\
-	NDINIT_AT(ndp, op, flags, segflg, namep, AT_FDCWD, td)
+	NDINIT_ALL(ndp, op, flags, segflg, namep, AT_FDCWD, NULL, td)
+#define	NDINIT_AT(ndp, op, flags, segflg, namep, dirfd, td)	\
+	NDINIT_ALL(ndp, op, flags, segflg, namep, dirfd, NULL, td)
+#define	NDINIT_ATVP(ndp, op, flags, segflg, namep, vp, td)	\
+	NDINIT_ALL(ndp, op, flags, segflg, namep, AT_FDCWD, vp, td)
 
 static __inline void
-NDINIT_AT(struct nameidata *ndp,
+NDINIT_ALL(struct nameidata *ndp,
 	u_long op, u_long flags,
 	enum uio_seg segflg,
 	const char *namep,
 	int dirfd,
+	struct vnode *startdir,
 	struct thread *td)
 {
 	ndp->ni_cnd.cn_nameiop = op;
@@ -165,6 +170,7 @@ NDINIT_AT(struct nameidata *ndp,
 	ndp->ni_segflg = segflg;
 	ndp->ni_dirp = namep;
 	ndp->ni_dirfd = dirfd;
+	ndp->ni_startdir = startdir;
 	ndp->ni_cnd.cn_thread = td;
 }
 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20080414/dc0e9302/attachment.pgp


More information about the freebsd-arch mailing list