changing semantics of the va_filerev (code review)

Tue Apr 14 08:52:35 PDT 2009

On Tue, 14 Apr 2009, Bruce Evans wrote:

[stuff snipped]
>
> Oops, I missed that since nfs's use of i_gen is indirect.  What does
> nfs do for file systems that don't detect removed files, e.g., msdosfs.
> vptofh and fhtovp routines seem to have too many differences.  E.g.,
An nfs client can always think that a file exists for a short period of
time (until client side caches time out) after it has been removed locally
or by another client, on the server. The more serious failure occurs when
the i-node/directory entry gets reallocated. At that point, the client 
might access the attributes/data of the new file, thinking it was the old
file. (In the worst case, this could persist until the client does a
umount() of the file system.)

However, typically, unless it has the file open when the file is removed
locally on the server or by another client, nothing nasty will happen.
(And I think if the client has name caching disabled, nothing nasty can
happen.)

At least, that's my best guess at an answer.

> file systems based on ffs return ESTALE for removed files, but
> zfs_fhtovp() returns EINVAL.
>
I don't know why zfs would choose a different errno, but I don't think
that a different errno will have much effect. It's a terminal error
in either case. (I can't think of anything clever that a client can
do for ESTALE. I wouldn't be surprised if some clients end up translating
ESTALE to EINVAL, since POSIX apps don't expect ESTALE.)

I suppose someone could argue it violates the RFC, but only if they know
that the server should generate NFS3ERR_STALE for that case.

> I just noticed than the increment of i_gen was slightly broken for ffs
> by a type mismatch in ffs2 (affects ffs1 too).  Originally, i_gen had
> the same type as di_gen (int32_t).  Now i_gen has type int64_t but in
> ffs1, di_gen of course still has type int32_t, and in ffs2, di_gen
> still has type int32_t (apparently there was insufficient space to
> expand it).  This makes the overflow check in ffs_alloc.c (++ip->i_gen
> == 0) more broken than before.  Previously it only gave undefined
> behaviour followed by a bogus check when overflow occurs for incrementing
> from INT32_T_MAX.  Now it has no effect, since it takes 293 years of
> incrementing at a rate of 1GHz to reach overflow at INT64_T_MAX.
> Overflow now occurs on assignment to di_gen.
>
> The result of this bug is almost the the same as removing the silly
> part of the security code -- the re-randomization on overflow.  i_gen
> may grow larger than UINT32_T_MAX, but usually refresh from the dinode
> will keep it smaller.  When it starts near UINT32_T_MAX and grows
> larger, the overflow on assignment and a subsequent refresh will make
> it nearly 0.  Except, in 1 in every 2**32 cases, when the overflow makes
> di_gen exactly 0, the subsequent refresh will randomize i_gen.
>
Sounds like you have a better understanding of this than I. Since all
nfs really cares about is that the value of i_gen has changed after
the i-node is re-allocated, I doubt this causes grief in practice.
Personally, I'd just leave it as a 32bit number and initialize it
to some pseudo-random value in a range that is a small fraction of
UINT32_T_MAX (maybe 1<->1000000) if it is 0, otherwise just increment
it by a small value. (I've already noted that I'm not a big fan of
security by obscurity anyhow:-)

>>> va_ctime should give what you want for all file systems, since it
>>> should be increased whenever anything changes.  However, most file
>> There are some places where IN_UPDATE gets set, but IN_CHANGE doesn't.
>
> Are there?  This would be a bug.  I checked that ffs doesn't have this
> bug.
>
Oops, my mistake. I grep'd again and see it is IN_CHANGE that gets set
without IN_UPDATE and not the other way around, which makes sense, since
I can't think of how you can modify the data without modifying some
attribute.

So, the Change attribute only needs to change for IN_CHANGE (with all 
those uses of "change", it must be good:-). Thanks for pointing this out.

>
> They need to be fixed or faked well enough for make(1) too.
>
> When the dinode has no space to spare, something can be done by keeping
> state in the inode or vnode.  This won't work across reboots of course
> (except by hashing a reboot counter into the generation counts or
> timestamps) but might be enough for all short-term uses.  I'm not sure
> how much is safe here.
>
Yes, definitely. I think doing something like having an in-memory field
for va_filerev/i_modrev where the high order bits are initialized by
ctime (using whatever bits are valid, given tod clock resolution) when 
read in and then incrementing by 1 for each change, would be a good 
compromise.

rick