mount ZFS snapshot on Linux system

Rick Macklem rmacklem at uoguelph.ca
Thu Dec 12 22:35:21 UTC 2013


Jason Keltz wrote:
> On 12/11/2013 06:21 PM, Rick Macklem wrote:
> > Jason Keltz wrote:
> >> On 10/12/2013 7:21 PM, Rick Macklem wrote:
> >>> Jason Keltz wrote:
> >>>> I'm running FreeBSD 9.2 with various ZFS datasets.
> >>>> I export a dataset to a Linux system (RHEL64), and mount it.  It
> >>>> works
> >>>> fine...
> >>>> When I try to access the ZFS snapshot directory on the Linux NFS
> >>>> client,
> >>>> things go weird.
> >>>>
> >>>> With NFSv4:
> >>>>
> >>>> [jas at archive /]# cd /mnt/.zfs/snapshot
> >>>> [jas at archive snapshot]# ls
> >>>> 20131203  20131205  20131206  20131207  20131208  20131209
> >>>>   20131210
> >>>> [jas at archive snapshot]# cd 20131210
> >>>> 20131210: Not a directory.
> >>>>
> >>>> huh?
> >>>>
> >>>> [jas at archive snapshot]# ls -al
> >>>> total 77
> >>>> dr-xr-xr-x   9 root root   9 Dec 10 11:20 .
> >>>> dr-xr-xr-x   4 root root   4 Nov 28 15:42 ..
> >>>> drwxr-xr-x 380 root root 380 Dec  2 15:56 20131203
> >>>> drwxr-xr-x 381 root root 381 Dec  3 11:24 20131205
> >>>> drwxr-xr-x 381 root root 381 Dec  3 11:24 20131206
> >>>> drwxr-xr-x 381 root root 381 Dec  3 11:24 20131207
> >>>> drwxr-xr-x 381 root root 381 Dec  3 11:24 20131208
> >>>> drwxr-xr-x 381 root root 381 Dec  3 11:24 20131209
> >>>> drwxr-xr-x 381 root root 381 Dec  3 11:24 20131210
> >>>> [jas at archive snapshot]# stat *
> >>>> [jas at archive snapshot]# ls -al
> >>>> total 292
> >>>> dr-xr-xr-x 9 root      root         9 Dec 10 11:20 .
> >>>> dr-xr-xr-x 4 root      root         4 Nov 28 15:42 ..
> >>>> -rw-r--r-- 1 uax    guest   137647 Mar 17  2010 20131203
> >>>> -rw-r--r-- 1 uax    guest         865 Jul 31  2009 20131205
> >>>> -rw-r--r-- 1 uax    guest   137647 Mar 17  2010 20131206
> >>>> -rw-r--r-- 1 uax    guest         771 Jul 31  2009 20131207
> >>>> -rw-r--r-- 1 uax    guest         778 Jul 31  2009 20131208
> >>>> -rw-r--r-- 1 uax     guest       5281 Jul 31  2009 20131209
> >>>> -rw------- 1 btx      faculty      893 Jul 13 20:21 20131210
> >>>>
> >>>> But it gets even more fun..
> >>>>
> >>>> # ls -ali
> >>>> total 205
> >>>>      2 dr-xr-xr-x   9 root      root       9 Dec 10 11:20 .
> >>>>      1 dr-xr-xr-x   4 root      root       4 Nov 28 15:42 ..
> >>>> 863 -rw-r--r--   1 uax     guest 137647 Mar 17  2010 20131203
> >>>>      4 drwxr-xr-x 381 root      root     381 Dec  3 11:24
> >>>>      20131205
> >>>>      4 drwxr-xr-x 381 root      root     381 Dec  3 11:24
> >>>>      20131206
> >>>>      4 drwxr-xr-x 381 root      root     381 Dec  3 11:24
> >>>>      20131207
> >>>>      4 drwxr-xr-x 381 root      root     381 Dec  3 11:24
> >>>>      20131208
> >>>>      4 drwxr-xr-x 381 root      root     381 Dec  3 11:24
> >>>>      20131209
> >>>>      4 drwxr-xr-x 381 root      root     381 Dec  3 11:24
> >>>>      20131210
> >>>>
> >>>> This is not a user id mapping issue because all the files in
> >>>> /mnt
> >>>> have
> >>>> the proper owner/groups, and I can access them there fine.
> >>>>
> >>>> I also tried explicitly exporting .zfs/snapshot.  The result
> >>>> isn't
> >>>> any
> >>>> different.
> >>>>
> >>>> If I use nfs v3 it "works", but I'm seeing a whole lot of errors
> >>>> like
> >>>> these in syslog:
> >>>>
> >>>> Dec 10 12:32:28 jungle mountd[49579]: can't delete exports for
> >>>> /local/backup/home9/.zfs/snapshot/20131203: Invalid argument
> >>>> Dec 10 12:32:28 jungle mountd[49579]: can't delete exports for
> >>>> /local/backup/home9/.zfs/snapshot/20131209: Invalid argument
> >>>> Dec 10 12:32:28 jungle mountd[49579]: can't delete exports for
> >>>> /local/backup/home9/.zfs/snapshot/20131210: Invalid argument
> >>>> Dec 10 12:32:28 jungle mountd[49579]: can't delete exports for
> >>>> /local/backup/home9/.zfs/snapshot/20131207: Invalid argument
> >>>>
> >>>> It's not clear to me why this doesn't just "work".
> >>>>
> >>>> Can anyone provide any advice on debugging this?
> >>>>
> >>> As I think you already know, I know nothing about ZFS and never
> >>> use it.
> >> Yup! :)
> >>> Having said that, I suspect that there are filenos (i-node #s)
> >>> that are the same in the snapshot as in the parent file system
> >>> tree.
> >>>
> >>> The basic assumptions are:
> >>> - within a file system, all i-node# are unique (represent one
> >>> file
> >>>     object only) and all file objects have the same fsid
> >>> - when the fsid changes, that indicates a file system boundary
> >>> and
> >>>     fileno (i-node#s) can be reused in the subtree with a
> >>>     different
> >>>     fsid
> >>>
> >>> For NFSv3, the server should export single volumes only (all
> >>> objects
> >>> have the same fsid and the filenos are unique). This is indicated
> >>> to
> >>> the VFS by the use of the NOCROSSMOUNT flag on VOP_LOOKUP() and
> >>> friends.
> >>>
> >>> For NFSv4, the server does export multiple volumes and the
> >>> boundary
> >>> is indicated by a change in fsid value.
> >>>
> >>> I suspect ZFS snaphots don't obey the above in some way, but that
> >>> is
> >>> just a hunch.
> >>>
> >>> Now, how to narrow this down...
> >>> - Do the above tests (both NFSv4 and NFSv3) and capture the
> >>> packets,
> >>>     then look at them in wireshark. In particular, look at the
> >>>     fileid numbers
> >>>     and fsid values for the various directories under .zfs.
> >> I gave this a shot, but I haven't used wireshark to capture NFS
> >> traffic
> >> before, so if I need to provide additional details, let me know..
> >>
> >> NFSv4:
> >>
> >> For /mnt/.zfs/snapshot/20131203:
> >> fileid=4
> >> fsid4.major=1446349656
> >> fsid4.minor=222
> >>
> >> For /mnt/.zfs/snapshot/20131205:
> >> fileid=4
> >> fsid4.major=1845998066
> >> fsid4.minor=222
> >>
> >> For /mnt/jas:
> >> fileid=144
> >> fsid4.major=597946950
> >> fsid4.minor=222
> >>
> >> For /mnt/jas1:
> >> fileid=338
> >> fsid4.major=597946950
> >> fsid4.minor=222
> >>
> >> So fsid is the same for all the different "data" directories,
> >> which
> >> is
> >> what I would expect given what you said.  I  guess each snapshot
> >> is
> >> seen
> >> as a unique filesystem...  but then a repeating inode in different
> >> filesystems shouldn't be a problem...
> >>
> > Yes, it appears that each snapshot is represented as a different
> > file
> > system. As such, NFSv4 should work for these, but there is an
> > additional
> > property of the "root" of each of these (20131203, ...).
> > When the directory .zfs/snapshot is read, the fileno for 20131203
> > should
> > be different than the fileno returned by VOP_GETATTR()/stat() for
> > "20131203".
> > (The old "mounted-on" vs "root-of-mounted-fs" vnodes which you get
> > for a
> >   "mount point".)
> > For NFSv4, the server returns the fileno in the VOP_READDIR()
> > dirent as a
> > separate attribute called mounted_on_fileid vs the value returned
> > by VOP_GETATTR()
> > as the fileid attribute.
> > If the value of these 2 attributes is the same, it is not a "mount
> > point".
> >
> > So, maybe you could take another look at the packet capture in
> > wireshark
> > and see what the fileid and mounted_on_fileid attributes are?
> 
> Unfortunately, I didn't save the log, but it was easy enough to
> regenerate.
> 
> But before we go there, I've spent a lot of time experimenting with
> this, so I can say...
> 
> If I NFSv4 mount nfs-server:/local/backup/home9 to /mnt, then I:
> cd /mnt/.zfs/snapshot/20131203
> ... it works great!  I can change into any user directory, list
> files, etc.
> If I then:
> cd /mnt/.zfs/snapshot/20131205
> .. it also works great!
> But... if I cd into /mnt/.zfs/snapshot, the free ride is over...
> all the snapshot directories appear as files and the problem is
> there.
> 
> ... unless I unmount and remount, in which case I can repeat.
> 
> I also found that a change of kernel from 2.6.32-358.14.1.el6  (the
> kernel I was running with RHEL6.4) to 2.6.32-431.el6 (the kernel that
> comes with RHEL6.5) does actually change something important....
> 
> If I mount nfs-server:/local/backup/home9 and try to change into
> "/mnt/.zfs/snapshot" with the new kernel, I still have the problem.
> Likewise, if I try to mount nfs-server:/local/backup/home9/.zfs, and
> change into "/mnt/snapshot",  I also have the problem.
> If I mount nfs-server:/local/backup/home9/.zfs/snapshot and change
> into
> "/mnt", I stil have the older problem, but with the RH 6.4 kernel in
> place.
> However, if I do the same mount with the newer kernel, it now works.
>  I
> can "ls" and see the snapshot directories.  I can change into any of
> them, then "cd .." and change into another one.
> I tested this on two systems - one where I just installed the entire
> 6.5
> upgrade, and the other where I just installed the kernel from 6.5 on
> the
> 6.4 system so it seems related to the kernel.
> It's still not clear why I can't just mount
> nfs-server:/local/backup/home9 on RHEL6.5, and the NFSv4 server
> figures
> it out.  I did try from another FreeBSD client, and I can mount the
> tree
> at any point, and the NFS server is happy.  This makes me believe
> it's
> probably a RHEL NFSv4 bug.
> 
> Here's the numbers..
> 
> NFSv4:
> 
> So, if I try to access the snapshot path directly, on the way ...
> 
> .zfs:
> V4 LOOKUP
> fsid.major: 597946950
> fileid: 1
> fattr owner/group are root - correct
> 
> snapshot:
> V4 LOOKUP
> fsid.major: 597946950
> fileid: 2
> fattr owner/group are root - correct
> 
> If I access /.zfs/snapshot/20131203 directly...:
> 
> 20131203:
> V4 LOOKUP
> fsid.major: 1446349656
> fileid: 4
> fattr owner/group are root - correct
> 
> V4 READDIR snapshot, 20121203 entry:
> fsid.major: 597946950 <-- ????
> fattr4_fileid: 863
> fattr4_owner/group refers to a group on our system (the one displayed
> in
> ls sometimes)..
> FATTR4_MOUNTED_ON_FILEID: 0x000000000000035f
> 
> But if I ls /mnt/.zfs/snapshot:
> 
> V4 LOOKUP:
> 201203:
> fsid.major: 597946950
> fileid: 4
> 
> V4 READDIR:
> fsid4.major: 597946950
> fattr4_fileid: 863
> fattr4_mounted_on_fileid: 0x000000000000035f
> 
I'll admit I'm not sure what you are looking at, but
the above does seem incorrect. Could you email me the
raw packet capture, by any chance? (In particular, I
need the packet capture for a readdir of .zfs/snapshot,
so I can look at the attributes of all the entries.)

Assuming the snapshots are represented as separate
file systems, when you do a readdir of .zfs/snapshot,
the fileid attribute and mounted_on_fileid attributes
should be different. (0x35f == 863) Also, the fsid
shouldn't be the same as .zfs/snapshot, which is
597946950 it seems.

> >> NFSv3:
> >>
> >> For /mnt/.zfs/snapshot/20131203:
> >> fileid=4
> >> fsid=0x0000000056358b58
> >>
> >> For /mnt/.zfs/snapshot/20131205:
> >> fileid=4
> >> fsid=0x000000006e07b1f2
> >>
> >> For /mnt/jas
> >> fileid=144
> >> fsid=0x0000000023a3f246
> >>
> >> For /mnt/jas1:
> >> fileid=338
> >> fsid=0x0000000023a3f246
> >>
> >> Here, it seems it's the same, even though it's NFSv3... hmm.
> >>
> >>
> >>> - Try mounting the individual snapshot directory, like
> >>>      .zfs/snapshot/20131209 and see if that works (for both NFSv3
> >>>      and NFSv4).
> >> Hmm .. I tried this:
> >>
> >> /local/backup/home9/.zfs/snapshot/20131203  -ro
> >> archive-mrpriv.cs.yorku.ca
> >> V4: /
> >>
> >> ... but syslog reports:
> >>
> >> Dec 10 22:28:22 jungle mountd[85405]: can't export
> >> /local/backup/home9/.zfs/snapshot/20131203
> >>
> > mountd will do a VFS_CHECKEXP(), which seems to fail for
> > these (which also explains the error messages). To be honest,
> > with these failing, remote access should fail.
> >
> > Also, since NFSv3 exported volumes should not cross
> > "mount points" (anywhere the fsid changes), all a mount
> > above .zfs/snapshot/20131203 should get are a bunch of
> > empty directories called 20131203,...
> I tried again just in case I missed something...
> nfs-server:/local/backup/home9 on /mnt type nfs
> (ro,vers=3,addr=172.16.2.26)
> I can change into /mnt/.zfs/snapshot/20131203/jas and list the
> directory, or less a file.
> 
> > For example, if in the UFS world with a separate
> > file systems /sub1 and /sub1/sub2 with both exported:
> > - an NFSv3 mount of /sub1 on /mnt would see an empty
> >    directory "sub2" when looking in /mnt. (Actually it
> >    isn't necessarily empty. It might have whatever is in
> >    the directory when /sub1/sub2 is not mounted.)
> >
> > This seems pretty obviously broken for ZFS, but I think
> > it needs to be fixed in ZFS and I have no idea how to do
> > that, since I don`t know if snapshots are real mount points, etc.
> >
> >> ... and of course I can't mount from either v3/v4.
> >>
> >> On the other hand, I kept it as:
> >>
> >> /local/backup/home9 -ro archive-mrpriv.cs.yorku.ca
> >> V4:/
> >>
> >> ... and was able to NFSv4 mount
> >> /local/backup/home9/.zfs/snapshot/20131203, and this does indeed
> >> work.
> >>
> > Yes, although technically it should not work unless 20131203 is
> > exported.
> Hmm..  I thought that this line in the exports man page meant that it
> was okay:
> 
> "Because NFSv4 does not use the mount protocol, the ``administrative
> controls'' are not applied.  Thus, all the above export line(s)
> should
> be considered to have the -alldirs flag, even if the line is
> specified
> without it."
> 
This means that all directories within a file system are exported.
Since .zfs/snapshot/20131203 is a separate file system, it should need
a separate export entry. ZFS likes to do its own thing w.r.t. exports,
so I am not sure what it is actually doing w.r.t. snapshots.

> > However, it is probably the easiest work around until this is fixed
> > someday.
> > So, just to make sure I am clear on this...
> > A NFSv4 mount of the snapshot works ok, even for a Linux client
> > mount.
> Yes.
> Although with the new kernel,  I can mount
> nfs-server:/local/backup/home9/.zfs/snapshot now as well... which is
> neat because it solves the problem I was trying to solve..
> I wanted users to be able to view their own snapshots, but not the
> snapshots of other users...
> Now, on the archive server, I can mount the snapshot dir via NFSv4,
> then, through autofs I am able to run a shell script that bind mounts
> the users own individual snapshot directories from the NFSv4 mount
> into
> one directory.  I then provide chrooted sftp access to that directory
> for users to get at their files.  A user now sees "20131203
> 20131204..."
> when they sftp in..
> 
I can't be sure, but since you mentioned above that it is "fixed" by
a dismount/remount, that would suggest it depends on what is cached
in the client and might break at different times, depending what is
cached?

If you can do a mount like:
# mount -t nfs4 nfs-server:/local/backup/home9/.zfs/snapshot/20131203 /mnt
that might work reliably, although it may not be what you want.

> >>> - Try doing the mounts with a FreeBSD client and see if you get
> >>> the
> >>> same
> >>>     behaviour?
> >> I found this:
> >> http://forums.freenas.org/threads/mounting-snapshot-directory-using-nfs-from-linux-broken.6060/
> >> .. implies it will work from FreeBSD/Nexenta, just not Linux.
> > I suspect this might be the mounted_on_fileid vs fileid issue.
> > (ie, The Linux client needs this to be done correctly, but the
> > other
> >   clients figure it out.)
> >
> > One case that might break for FreeBSD would be to cd into a
> > snapshot
> > and then do a pwd with the debug.disablecwd sysctl set to 1.
> >
> > Hopefully the ZFS wizards are reading this, rick
> Me too!
> 
> Jason.
> 
> 


More information about the freebsd-fs mailing list