Fix MNAMELEN or reimplement struct statfs

Mon Nov 18 19:01:45 UTC 2013

On Sat, Nov 16, 2013 at 08:31:29PM +0200, Konstantin Belousov wrote:
| On Thu, Nov 14, 2013 at 05:08:54PM -0800, Doug Ambrisko wrote:
| > On Thu, Nov 14, 2013 at 09:32:17PM +0000, Jase Thew wrote:
| > | On 10/06/2013 16:52, John Baldwin wrote:
| > | > On Saturday, June 08, 2013 9:36:27 pm mdf at freebsd.org wrote:
| > | >> On Sat, Jun 8, 2013 at 3:52 PM, Dirk Engling <erdgeist at erdgeist.org> wrote:
| > | >>
| > | >>> The arbitrary value
| > | >>>
| > | >>> #define MNAMELEN        88              /* size of on/from name bufs */
| > | >>>
| > | >>> struct statfs {
| > | >>> [...]
| > | >>>         char    f_mntfromname[MNAMELEN];/* mounted filesystem */
| > | >>>         char    f_mntonname[MNAMELEN];  /* directory on which mounted */
| > | >>> };
| > | >>>
| > | >>> currently bites us when trying to use poudriere with errors like
| > | >>>
| > | >>> 'mount: tmpfs: File name too long'
| > | >>>
| > | >>>
| > | >>> /poudriere/data/build/91_RELEASE_amd64-REALLY-REALLY-LONG-
| > | > JAILNAME/ref/wrkdirs
| > | >>>
| > | >>> The topic has been discussed several times since 2004 and has been
| > | >>> postponed each time, the last time when it hit zfs users:
| > | >>>
| > | >>> http://lists.freebsd.org/pipermail/freebsd-fs/2010-March/007974.html
| > | >>>
| > | >>> So I'd like to point to the calendar, it's 2013 already and there's
| > | >>> still a static arbitrary (and way too low) limit in one of the core
| > | >>> areas of the vfs code.
| > | >>>
| > | >>> So I'd like to bump the issue and propose either making f_mntfromname a
| > | >>> dynamic allocation or just increase MNAMELEN, using 10.0 as water shed.
| > | >>>
| > | >>
| > | >> Gleb Kurtsou did this along with the ino64 GSoC project.  Unfortunately,
| > | >>  both he and I hit ENOTIME due to the job that pays the bills and it's
| > | >> never made it back to the main repository.
| > | >>
| > | >> IIRC, though, the only reason for doing it with 64-bit ino_t is that he'd
| > | >> already finished changing the stat/dirent ABI so what was one more.  I
| > | >> think he went with 1024 bytes, which also necessitated not allocating
| > | >> statfs on the stack for the kernel.
| > | > 
| > | > He also fixed a few other things since changing this ABI is so invasive
| > | > IIRC.  This really is the right fix for this.  Is it in an svn branch 
| > | > that can be updated and a new patch generated?
| > | > 
| > | 
| > | Hi folks,
| > | 
| > | Has there been any progress on addressing this issue? With the advent of
| > | pkgng / poudriere, this limitation is really becoming a frustrating problem.
| > 
| > I looked at NetBSD and what they did with statvfs.  The mount paths
| > lengths are bigger in NetBSD defines so that helps.  However, when
| > testing it out via a script that keep on doing a nullfs mount in 
| > every increasing directory depth I found that NetBSD would allow the
| > mount to exceed the value in statvfs.  When NetBSD populates the path
| > in statvfs they truncate it to what fits in statvfs.  So I looked at
| > what that might be like in FreeBSD.  So I came up with this simple patch:
| > 
| > --- /sys/kern/vfs_mount.c	2013-10-01 14:27:35.000000000 -0700
| > +++ vfs_mount.c	2013-10-21 14:20:19.000000000 -0700
| > @@ -656,7 +656,7 @@ vfs_donmount(struct thread *td, uint64_t
| >  	 * variables will fit in our mp buffers, including the
| >  	 * terminating NUL.
| >  	 */
| > -	if (fstypelen >= MFSNAMELEN - 1 || fspathlen >= MNAMELEN - 1) {
| > +	if (fstypelen >= MFSNAMELEN - 1 || fspathlen >= MAXPATHLEN - 1) {
| >  		error = ENAMETOOLONG;
| >  		goto bail;
| >  	}
| > @@ -748,8 +748,8 @@ sys_mount(td, uap)
| >  		return (EOPNOTSUPP);
| >  	}
| >  
| > -	ma = mount_argsu(ma, "fstype", uap->type, MNAMELEN);
| > -	ma = mount_argsu(ma, "fspath", uap->path, MNAMELEN);
| > +	ma = mount_argsu(ma, "fstype", uap->type, MFSNAMELEN);
| > +	ma = mount_argsu(ma, "fspath", uap->path, MAXPATHLEN);
| >  	ma = mount_argb(ma, flags & MNT_RDONLY, "noro");
| >  	ma = mount_argb(ma, !(flags & MNT_NOSUID), "nosuid");
| >  	ma = mount_argb(ma, !(flags & MNT_NOEXEC), "noexec");
| > @@ -1039,7 +1039,7 @@ vfs_domount(
| >  	 * variables will fit in our mp buffers, including the
| >  	 * terminating NUL.
| >  	 */
| > -	if (strlen(fstype) >= MFSNAMELEN || strlen(fspath) >= MNAMELEN)
| > +	if (strlen(fstype) >= MFSNAMELEN || strlen(fspath) >= MAXPATHLEN)
| >  		return (ENAMETOOLONG);
| >  
| >  	if (jailed(td->td_ucred) || usermount == 0) {
| > @@ -1095,9 +1095,9 @@ vfs_domount(
| >  	NDFREE(&nd, NDF_ONLY_PNBUF);
| >  	vp = nd.ni_vp;
| >  	if ((fsflags & MNT_UPDATE) == 0) {
| > -		pathbuf = malloc(MNAMELEN, M_TEMP, M_WAITOK);
| > +		pathbuf = malloc(MAXPATHLEN, M_TEMP, M_WAITOK);
| >  		strcpy(pathbuf, fspath);
| > -		error = vn_path_to_global_path(td, vp, pathbuf, MNAMELEN);
| > +		error = vn_path_to_global_path(td, vp, pathbuf, MAXPATHLEN);
| >  		/* debug.disablefullpath == 1 results in ENODEV */
| >  		if (error == 0 || error == ENODEV) {
| >  			error = vfs_domount_first(td, vfsp, pathbuf, vp,
| > @@ -1147,8 +1147,8 @@ sys_unmount(td, uap)
| >  			return (error);
| >  	}
| >  
| > -	pathbuf = malloc(MNAMELEN, M_TEMP, M_WAITOK);
| > -	error = copyinstr(uap->path, pathbuf, MNAMELEN, NULL);
| > +	pathbuf = malloc(MAXPATHLEN, M_TEMP, M_WAITOK);
| > +	error = copyinstr(uap->path, pathbuf, MAXPATHLEN, NULL);
| >  	if (error) {
| >  		free(pathbuf, M_TEMP);
| >  		return (error);
| > @@ -1181,7 +1181,7 @@ sys_unmount(td, uap)
| >  			vfslocked = NDHASGIANT(&nd);
| >  			NDFREE(&nd, NDF_ONLY_PNBUF);
| >  			error = vn_path_to_global_path(td, nd.ni_vp, pathbuf,
| > -			    MNAMELEN);
| > +			    MAXPATHLEN);
| >  			if (error == 0 || error == ENODEV)
| >  				vput(nd.ni_vp);
| >  			VFS_UNLOCK_GIANT(vfslocked);
| > 
| > I seemed to have found a typo bug in an instance in which MFSNAMELEN
| > wasn't used in the fstype when I did this change.
| > 
| > With this patch things in general seem to work.  You can do a
| > mount and umount of a long path.  The umount of the long path works
| > by failing on the exact match but then passing when via the FSID.
| > df/mount looks a little strange since it shows a truncated path 
| > but has valid contents (FS type, space etc.).  umount via the truncated
| > path works if there is only one truncated path that matches.  If there
| > are multiple then it fails.
| > 
| > This doesn't change and kernel ABI's so then it is safe to apply to the
| > kernel without rebuilding user-land.
| > 
| > Future work could be to implement statvfs to return a longer path but
| > only do it for df/umount etc.  The rest of the system could continue
| > with the existing statfs.  mount works because it passed a string into
| > the kernel so it can be long.
| > 
| > I'd propose this as a current solution to this problem.  It appears to
| > work well and shouldn't drastically break things.  Doing df via the
| > full path, stat etc. work since the associated path access the vnode.
| > So things that do a mount, df of the mount point etc. should continue
| > to work.  Scripts that try to figure out the mount points vi df and mount
| > displaying all mount points would fail.  That is probably good enough for
| > now.
| > 
| > Comments welcomed.
| 
| Generally, I agree with the approach, but what is done seems to be too
| simple to be usable.

I like the simplicity and I'd like to see examples of not being usable.

| One obvious and important thing which is broken with the patch is
| the unmounts from jails. In other words, now it is possible to mount
| something from jail with appropriate privileges set up, and after that
| corresponding mount cannot be unmounted, since vfs_mount_alloc() copies
| trimmed path into f_mntonname, and sys_unmount() matches full path with
| pathbuf.  Hmm, this should be broken in the same way for non-jailed
| mounts with pathes which do not fit into f_mntonname.

They can be umounted since it will fall back to fsid as in the non-jail
case.  I just tried it sorry for the bad line wrap:

+ mount 192.168.38.1:/data/home/ambrisko/netboot /data/jail/test
+ jail -i -c name=test path=/data/jail/test host.hostname=test.ambrisko.com persist enforce_statfs=0 allow.mount=1 allow.mount.devfs=1 allow.mount.nullfs=1 allow.mount.tmpfs=1 allow.mount.procfs=1
14
+ jexec test mkdir -p /1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890/proc
+ jexec test mkdir -p /1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890/dev
+ jexec test df
+ egrep '^devfs|^procfs'
devfs                                             2         2          0   100%    /dev
procfs                                            8         8          0   100%    /proc
+ jexec test mount -t procfs null //1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890/proc
+ jexec test mount -t devfs null //1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890/dev
+ jexec test df
+ egrep '^devfs|^procfs'
devfs                                             2         2          0   100%    /dev
procfs                                            8         8          0   100%    /proc
procfs                                            8         8          0   100%    /data/jail/test/12345678901234567890123456789012345678901234567890123456789012345678901
devfs                                             2         2          0   100%    /data/jail/test/12345678901234567890123456789012345678901234567890123456789012345678901
+ jexec test mount -v
+ egrep '^devfs|^procfs'
devfs on /dev (devfs, local, multilabel, fsid 00ff007171000000)
procfs on /proc (procfs, local, fsid 02ff000202000000)
procfs on /data/jail/test/12345678901234567890123456789012345678901234567890123456789012345678901 (procfs, local, fsid 26ff000202000000)
devfs on /data/jail/test/12345678901234567890123456789012345678901234567890123456789012345678901 (devfs, local, multilabel, fsid 27ff007171000000)
+ jexec test umount /1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890/proc
+ jexec test df
+ egrep '^devfs|^procfs'
devfs                                             2         2          0   100%    /dev
procfs                                            8         8          0   100%    /proc
devfs                                             2         2          0   100%    /data/jail/test/12345678901234567890123456789012345678901234567890123456789012345678901
+ jexec test umount /1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890/dev
+ jexec test df
+ egrep '^devfs|^procfs'
devfs                                             2         2          0   100%    /dev
procfs                                            8         8          0   100%    /proc

| I think that struct mount should have a const char * field where the
| non-trimmed path is stored and used for match at unmount. f_mntonname
| truncation would be only unfortunate user interface glitch.

Note that we are not storing the path in mount structure so no structures
have changed which is nice since then we haven't introduced any real
ABI breakage.  So we could MFC this.  The match isn't critical since
umount will fall back to fsid and work.  One thing that might be good to
do is change umount to try to umount via fsid first and then do the
match if the fsid failed versus the other way round that it does now.

The problem I see is if someone tries to do things based on the parsed
output of mount/df then that will fail since the output is truncated.

Thanks for looking at this,

Doug A.