ufs snapshot is sometimes corrupt on gjourneled partition

Sun Jul 16 16:31:10 UTC 2017

I am copying Kostik on this email even though he is on freebsd-fs
as I would like him to review my answer below.

> Date: Sun, 16 Jul 2017 17:52:39 +0200
> From: Andreas Longwitz <longwitz at incore.de>
> To: Kirk McKusick <mckusick at mckusick.com>
> CC: freebsd-fs at freebsd.org
> Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition
> 
> Hello,
> 
> I like to discuss the following code snippet from function
> indiracct_ufs2() in source file ffs_snapshot.c:
> 
>     /*
>      * We have to expand bread here since it will deadlock looking
>      * up the block number for any blocks that are not in the cache.
>      */
>     bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0);
>     bp->b_blkno = fsbtodb(fs, blkno);
>     if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 &&
>         (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) {
>             brelse(bp);
>             return (error);
>     }
> 
> In all my tests (all mount types, good or bad result) the flags B_DONE
> and B_DELWRI in bp->b_flags were always zero, so the data buffer given by
> getblk() at bp->b_data is always overwritten with the data from the
> explicit I/O performed by readblock(). By the way the flag B_CACHE was
> always set.

Key observation here that B_CACHE is set! That means that the data
in the block is valid and does not need to be read again. So the fix
is this:

Index: sys/ufs/ffs/ffs_snapshot.c
===================================================================

--- sys/ufs/ffs/ffs_snapshot.c	(revision 321045)
+++ sys/ufs/ffs/ffs_snapshot.c	(working copy)
@@ -1394,7 +1394,7 @@
 	 */
 	bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0);
 	bp->b_blkno = fsbtodb(fs, blkno);
-	if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 &&
+	if ((bp->b_flags & (B_DONE | B_DELWRI | B_CACHE)) == 0 &&
 	    (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) {
 		brelse(bp);
 		return (error);

Absent gjournal the extra read was unnecessary but harmless. The
buffer exists with B_DELWRI so we will not read it, or it is locked
while being written to disk, so we will wait for the write to complete,
after which it is safe to read it. With gjournal there is a window of
vulnerability between when the buffer is marked B_DELWRI and when it
is actually written to disk. This is because gjournal will say the
write has completed when it has not yet actually done so. Thus when
we read it, we get incorrect results. But until gjournal actually
completes the write, it will keep the buffer around (marked B_CACHE),
so as long as we check for B_CACHE to avoid the read (as we should
anyway to avoid an unnecessary read) we should always get consistent
results.

> I have saved the buffer given by getblk() and compared this buffer
> with the data read by readblock(). Without gjournal there is never
> a difference. Using a gjournaled partition  things are more
> sophisticated. Taking a snapshot in my example environment needs
> 678 calls of getblk(). In the "good case" all I/O's performed by
> readblock() gave the same data as getblk(), in the "bad case" there
> are some differences between the data in buffer cache seen by
> getblk() and the data on disk seen by readblock(). Mostly two or
> three of the 678 blocks are different. In this case the block read
> by readblock() is always zero, the block read by getblk() has a lot
> of BLK_NOCOPY values, sometimes 4096 (a full block).
> 
> There is one other flag in bp->b_flags given by getblk that is of
> interest: B_CLUSTEROK. This flag is always in sync with the flag
> BV_SCANNED in bp->b_vflags. In the above code snippet I can check this
> flag too before going to disk:
> 
>    if (bp->b_flags & (B_DONE | B_DELWRI | B_CLUSTEROK)) == 0 && ...
> 
> With this patch the problem has gone in all my tests. The resulting
> snapblklist written to the end of the snapfile is always the same.

It is merely coincidence that B_CLUSTEROK is set. But the fact that
skipping the read fixes your problem leads me to believe that checking
for B_CACHE will fix your problem.

> I am not familiar with buffer and memory management of FreeBSD, so I can
> not explain what is going on. But I see in cgaccount() a lot of
> BLK_NOCOPY values are set and in expunge_ufs2() these values should be
> seen again. It seems they are in the buffer cache but sometimes not on
> disk. The patch given above in my test example reads data from buffer
> cache instead of from disk 141 times.

The number of buffer flags have grown over time and their meaning has
subtlely changed. Even I who have been around since the beginning cannot
claim to fully comprehend exactly what they mean. So, the fact that you
find them incomprehensible is unfortunately all too understandable.

> All sync operations in ffs_snapshot() use the parameter MNT_WAIT, the
> gjournal switcher runs all the time but syncs the same partition with
> MNT_NOWAIT (g_journal.c: vfs_msync followed by VFS_SYNC). I do not know
> if this can lead to a race condition between gjournal and snapshot.

I went down this line of thinking with your first email and did not come
up with anything that explained the problem. So I am glad that you have
found something that I do believe explains what is going on.

	Kirk McKusick