[rfc] 64-bit inode numbers

Benjamin Kaduk kaduk at MIT.EDU
Sat Jun 25 03:53:42 UTC 2011

Hmm, several messages regarding AFS that I will try to address at once.

On Fri, 24 Jun 2011, Kostik Belousov wrote:
> On Thu, Jun 23, 2011 at 06:05:56PM -0400, Garance A Drosehn wrote:
>> Consider the thread "Increasing the size of dev_t and ino_t" from
>> freebsd-arch in 2002:
>> http://docs.freebsd.org/mail/archive/2002/freebsd-arch/20020317.freebsd-arch.html
>> In particular, this message by Robert Watson:
>> http://docs.freebsd.org/cgi/getmsg.cgi?fetch=139853+0+archive/2002/freebsd-arch/20020317.freebsd-arch
>> I just participated in an online conference for OpenAFS, and while it
>> isn't exactly taking the world by storm, I keep thinking it would be
>> useful if FreeBSD could map individual AFS volumes to unique dev_t
>> identifiers.  And given the way AFS is implemented (as a global filesystem
>> with many cells all reachable at the same time), and given the way most
>> sites deploy AFS (with thousands or tens-of-thousands of individual AFS
>> volumes *per site*), that adds up to a lot of values for dev_t.
>> The upcoming release of OpenAFS should include a working and pretty
>> stable AFS client for FreeBSD, so having a larger dev_t would have a
>> more immediate application than it did back in 2002.
> Am I right that the issue is the uniqueness of the dev_t for each
> AFS volume, as reported by stat(2) ?
> Shouldn't the AFS client synthesize the dev_t for each new volume
> mounted ? It seems that the current 32bit dev_t would be enough,
> since I do not expect to see hundreds of thousands of mounts
> on an single system.

The current OpenAFS implementation only presents a single mountpoint, 
/afs, and does not really distinguish between different mounted volumes. 
This is not ideal, and we would like to be able to make each volume appear 
as a separate device if there's a good way to do so.  The technical 
challenge of doing this while sill only having a single mount method for 
AFS is not something that I've looked at, there being more pressing issues 
on my plate.

> Please note that we do not guarantee dev_t stability across reboots even
> for real devices.

Hmm, this is somewhat annoying, as the AFS global namespace does provide a 
stable unique identifier for files/directories using a unique cell ID, 
volume ID, per-file ID, and uniquifier.  Being able to directly use the 
cell/volume information for a dev_t would be quite convenient.

On Fri, 24 Jun 2011, Bruce Evans wrote:
> mnt_stat.f_fsid is generated from the dev_t, and tries to give stability
> across reboots.  Otherwise, IIRC, nfs mounts break if the server is
> rebooted.  Not only the dev_t part, but other things in f_fsid, depend
> on the order of initialization, but the ids usually end up the same if
> you don't reconfigure much on the server.
> f_fsid also has a problem with uniqeness, but that is mainly because it
> wants to be unique when truncated to a 16-bit dev_t.  dev_t is only 16
> bits in some versions of Linux, including in the FreeBSD i386 Linux
> emulator (I can see traces of 32-bit dev_t in Linux-2.6.10 but not in
> the FreeBSD emulator).
> I hope AFS ids could be implemented like fsids and not need to literally
> match foreign ids, but if they are synthesized then they might be harder
> than fsids to keep invariant across reboots.

I'm not sure how one would have a chance of keeping things invariant 
across reboots other than to use the cell/volume IDs in some fashion.
That said, the AFS client maintains its own copy of these unique IDs in 
the fs-specific vnode area, and should be able to talk to the server just 
fine if the fsids end up faked.  Keeping the fake fsids consistent if a 
file goes in and out of the local cache may be a different issue, though.

On Fri, 24 Jun 2011, Rick Macklem wrote:

> Garance A Drosehn wrote:
>> The AFS cell at RPI has approximately 40,000 AFS volumes, and each
>> volume should have it's own dev_t (IMO). That's just counting the
>> collection of AFS volumes which are on RPI file servers, and any
>> user sitting on one computer could access AFS volumes which are
>> made available by other sites (aka "AFS cells"). Most RPI users
>> would only have access to maybe 1/4 of those volumes which exist
>> at RPI, but we do know that individual users have run 'find' over
>> the entire RPI cell looking for whatever they're looking for. I
>> once did a run of 'md5deep' on the entire RPI cell, thanks to a
>> symlink which I didn't realize was in my home directory!

We have almost 50,000 volumes in the athena cell, here.

> Note that it the value in mnt_stat.f_fsid that needs to be unique w.r.t
> other mount points in the machine. If AFS appears to be one mount
> point in the FreeBSD client, then the only issue I know of is how
> the client is expected to handle changes in dev_t within the mount

Er, how is the client expected to communicate these changes?  As mentioned 
above, I believe we currently present only a single device and mountpoint, 
which is suboptimal.  (Actually, it looks like we don't even initialize 
mnt_stat.f_fsid at all if I'm reading the current code correctly.  Oops.)
I would love to be able to present volume mountpoints as actually being 

> point. fts(3) and friends will assume that it is a mount point
> crossing when st_dev changes. It will then expect that the funny
> rule that the d_ino in dirent will not be the same as st_ino.
> What I do for NFSv4 is sythesize  the mnt_stat.f_fsid value and
> return that as st_dev for the mounted volume until I see the fsid
> returned by the server change. Below that point, I return the fsid
> from the server as st_dev so long as it isn't the same as the

I think I'm confused.  You're ... walking a directory heirarchy, and 
return a fake st_dev value but hold onto the fsid value from the server, 
then when the fsid from the server changes (due to a ... different NFS 
mount?), start reporting that new fsid and throw away the fake st_dev 
value?  Can you point me at the code that is doing this?

> synthesized one. That way, fts(3) and friends figure out the mount
> point crossings within the server.
> "ls -lR" will usually find problems if this is broken.
>> So one person can easily trigger the access of 10,000 AFS volumes
>> on one computer using one command. That might sound terrifying if
>> you imagine it as being 10,000 NFS mounts, but accessing AFS volumes
>> isn't the same amount of work as auto-mounting NFS filesystems.
>> So ignore whatever problems you might expect to see with 10,000
>> filesystems mounted on one computer. Just realize that it is very
>> easy for a single user to access tens of thousands of AFS volumes
>> from one computer, and it would be "most correct" (programming wise)
>> if all of those AFS volumes were to get a unique value for dev_t.
>> And of course it's even easier for a remote-access system to access
>> tens-of-thousands of AFS volumes, since it would have a few dozen
>> users logged in at the same time.

I guess, at the end of the day, it's not clear to me what OpenAFS should 
do when we finally get around to exposing AFS volume mountpoints as device 
mountpoints to userland.  Reusing existing globally-unique AFS ID 
information would be nice, but how to cleanly transform that to a smaller 
unique ID for the particular machine in question?

-Ben Kaduk

More information about the freebsd-fs mailing list