Extending di_nlink and its ilk

Tue Jan 4 00:08:50 GMT 2005

Kenneth Vestergaard Schmidt wrote:
> Hello.
> 
> I've run into a wee problem trying to create a nice backup-machine. We made
> it using rsync, hardlinks, and a modified link-by-hash patch for rsync.
> 
> link-by-hash creates an md4 checksum of the file's contents. It then stores
> the file in /dana/hashes/abcdef/1234567890 and hardlinks it to the correct
> place. This way, identical files only get stored once.
> 
> At this point, we ran into the problem with di_nlink and related fields
> only being 16-bit, since we were creating more than 32765 sub-directories.
> 
> I fixed this by only creating 256 directories, each containing a lot of
> files. However, we soon ran into yet another problem, that of more than
> 32767 links to one file - when we link by contents, this limit comes up
> real quick.
> 
> My initial idea was to patch the file-system to use one of the spare
> values at the end of various inode-structs to provide a 32-bit or 64-bit
> value to the link count. Of course, some backward-compatible scheme must
> be employed were the original di_nlink is read first, but I wanted to
> hear if this is a totally hare-brained scheme before I start doing it,
> or if it would actually be useful to others?
> 
> The only other choice I have is a couple of extremely ugly hacks to rsync,
> which I'd rather not do.
> 
> 

The downside to having really large directories is that the lookup and 
readdir operations are linear in UFS.  The DIRHASH code helps this quite
a bit, but it's still not very optimal.  The performance scalability 
problem is likely why there has been little pressure to increase the 
size of di_nlink.

Assuming that performance is not an important consideration, the next
problem is how to specify an alternate link counter (let's call it 
di_nlink2 here) in a way that is as backwards compatible as possible.
How will tools like fsck and dump, let alone the kernel FS code, know
to use the new field as opposed to the old one?  Do we assign special
meaning to a specific value of di_nlink?  i.e. if di_nlink is set to
0xffff, then everything should assume to use di_nlink2?  What pretects
us from a version of fsck that doesn't understand the magic value from
completely trashing the FS?  There are no magic values for di_nlink
right now, so anything that you choose has the possibility of colliding
with a valid value.

It's quite common to share a disk between different versions of BSD, as 
well as different BSDs, so you really cannot have too many seatbelts 
here. We could bump the magic in the superblock and use that to instruct 
the tools on how to treat di_nlink, but that's a pretty dramatic change 
and will make it much harder to share disks.  It would basically amount
to creating 'UFS3', and at that point it would be more prudent to review
better directory layout policies at the same time.

Scott