ufs snapshot is sometimes corrupt on gjourneled partition

Mon Jul 17 08:15:04 UTC 2017

> On Sun, Jul 16, 2017 at 09:33:55AM -0700, Kirk McKusick wrote:
>> I am copying Kostik on this email even though he is on freebsd-fs
>> as I would like him to review my answer below.
>>
>>> Date: Sun, 16 Jul 2017 17:52:39 +0200
>>> From: Andreas Longwitz <longwitz at incore.de>
>>> To: Kirk McKusick <mckusick at mckusick.com>
>>> CC: freebsd-fs at freebsd.org
>>> Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition
>>>
>>> Hello,
>>>
>>> I like to discuss the following code snippet from function
>>> indiracct_ufs2() in source file ffs_snapshot.c:
>>>
>>>     /*
>>>      * We have to expand bread here since it will deadlock looking
>>>      * up the block number for any blocks that are not in the cache.
>>>      */
>>>     bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0);
>>>     bp->b_blkno = fsbtodb(fs, blkno);
>>>     if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 &&
>>>         (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) {
>>>             brelse(bp);
>>>             return (error);
>>>     }
>>>
>>> In all my tests (all mount types, good or bad result) the flags B_DONE
>>> and B_DELWRI in bp->b_flags were always zero, so the data buffer given by
>>> getblk() at bp->b_data is always overwritten with the data from the
>>> explicit I/O performed by readblock(). By the way the flag B_CACHE was
>>> always set.
>> Key observation here that B_CACHE is set! That means that the data
>> in the block is valid and does not need to be read again. So the fix
>> is this:
>>
>> Index: sys/ufs/ffs/ffs_snapshot.c
>> ===================================================================
>> --- sys/ufs/ffs/ffs_snapshot.c	(revision 321045)
>> +++ sys/ufs/ffs/ffs_snapshot.c	(working copy)
>> @@ -1394,7 +1394,7 @@
>>  	 */
>>  	bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0);
>>  	bp->b_blkno = fsbtodb(fs, blkno);
>> -	if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 &&
>> +	if ((bp->b_flags & (B_DONE | B_DELWRI | B_CACHE)) == 0 &&
>>  	    (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) {
>>  		brelse(bp);
>>  		return (error);
>>

This patch was my first try, because the comment in the code snippet
points to B_CACHE. Unfortunaly it did not work, because the getblk()
buffer does not always have the correct data, the following "rm
snapfile" crashed always immediately with

  panic: ffs_blkfree_cg: freeing free block.

Therfore I came up with the CLUSTEROK hack.

>> Absent gjournal the extra read was unnecessary but harmless. The
>> buffer exists with B_DELWRI so we will not read it, or it is locked
>> while being written to disk, so we will wait for the write to complete,
>> after which it is safe to read it. With gjournal there is a window of
>> vulnerability between when the buffer is marked B_DELWRI and when it
>> is actually written to disk. This is because gjournal will say the
>> write has completed when it has not yet actually done so. Thus when
>> we read it, we get incorrect results. But until gjournal actually
>> completes the write, it will keep the buffer around (marked B_CACHE),
>> so as long as we check for B_CACHE to avoid the read (as we should
>> anyway to avoid an unnecessary read) we should always get consistent
>> results.
> I tend to disagree with the fix. What if the cached block being
> discarded before we enter this code fragment ?
> 
> I believe the bug is indeed clear after your analysis, but it is in
> gjournal.  Problem is that the disk (gjournal substrate layer, but for UFS
> it is all disk) returns data different from what was written to the
> volume.  After the write bio signalled completion, unless other write
> was done after it, reads _must_ return the content of the write.
> 
> The right fix is for gjournal is to perform 'store forwarding', i.e.
> to return data from the journal until journal is flushd to disk.

That makes sense to me and explains why the problem never occurs after a
clean mount, it needs the foregoing "rm snapfile" to trigger the
gjournal bug.

To be complete:
1. I run gjournal above gmirror.
2. I use the patch given in kern/198500.
3. In my loader.conf I have
   # use max. 8GB (1/4 of vm.kmem_size) for gjournal cache
   kern.geom.journal.cache.divisor="4"
It is a little bit annoying, but setting kern.geom.journal.cache.limit
in loader.conf does not work because its overwritten in gjournal_init().

-- 
Dr. Andreas Longwitz