ufs snapshot is sometimes corrupt on gjourneled partition

Tue Jul 18 00:40:44 UTC 2017

> Date: Mon, 17 Jul 2017 10:14:57 +0200
> From: Andreas Longwitz <longwitz at incore.de>
> To: Konstantin Belousov <kostikbel at gmail.com>
> CC: Kirk McKusick <mckusick at mckusick.com>, freebsd-fs at freebsd.org
> Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition
> 
>> On Sun, Jul 16, 2017 at 09:33:55AM -0700, Kirk McKusick wrote:
>>> I am copying Kostik on this email even though he is on freebsd-fs
>>> as I would like him to review my answer below.
>>>
>>>> Date: Sun, 16 Jul 2017 17:52:39 +0200
>>>> From: Andreas Longwitz <longwitz at incore.de>
>>>> To: Kirk McKusick <mckusick at mckusick.com>
>>>> CC: freebsd-fs at freebsd.org
>>>> Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition
>>>>
>>>> Hello,
>>>>
>>>> I like to discuss the following code snippet from function
>>>> indiracct_ufs2() in source file ffs_snapshot.c:
>>>>
>>>>     /*
>>>>      * We have to expand bread here since it will deadlock looking
>>>>      * up the block number for any blocks that are not in the cache.
>>>>      */
>>>>     bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0);
>>>>     bp->b_blkno = fsbtodb(fs, blkno);
>>>>     if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 &&
>>>>         (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) {
>>>>             brelse(bp);
>>>>             return (error);
>>>>     }
>>>>
>>>> In all my tests (all mount types, good or bad result) the flags B_DONE
>>>> and B_DELWRI in bp->b_flags were always zero, so the data buffer given by
>>>> getblk() at bp->b_data is always overwritten with the data from the
>>>> explicit I/O performed by readblock(). By the way the flag B_CACHE was
>>>> always set.
>>>
>>> Key observation here that B_CACHE is set! That means that the data
>>> in the block is valid and does not need to be read again. So the fix
>>> is this:
>>>
>>> Index: sys/ufs/ffs/ffs_snapshot.c
>>> ===================================================================
>>> --- sys/ufs/ffs/ffs_snapshot.c	(revision 321045)
>>> +++ sys/ufs/ffs/ffs_snapshot.c	(working copy)
>>> @@ -1394,7 +1394,7 @@
>>>  	 */
>>>  	bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0);
>>>  	bp->b_blkno = fsbtodb(fs, blkno);
>>> -	if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 &&
>>> +	if ((bp->b_flags & (B_DONE | B_DELWRI | B_CACHE)) == 0 &&
>>>  	    (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) {
>>>  		brelse(bp);
>>>  		return (error);
>>>
> 
> This patch was my first try, because the comment in the code snippet
> points to B_CACHE. Unfortunaly it did not work, because the getblk()
> buffer does not always have the correct data, the following "rm
> snapfile" crashed always immediately with
> 
>   panic: ffs_blkfree_cg: freeing free block.
> 
> Therfore I came up with the CLUSTEROK hack.

Kostik's comment below that this is not a complete solution is
correct. Once gjournal has put the block in its log it releases
the buffer and it may fall out of the cache even though the log
has not been written to backing store. So, it may still be necessary
to get the contents filled. But, if B_CACHE is set, then the contents
should be valid as that is exactly what bread uses to decide if the
buffer needs to be read (see breadn_flags() in /sys/kern/vfs_bio.c).
So, I really do not understand how you can end up with a buffer with
invalid contents if B_CACHE is set because we are open coding bread()
here and that is the criterion it uses to read. If however, you do
the read when B_CACHE is not set, then as noted you will get the
wrong results and crash. But see also my comments below when B_CACHE
is not set.

>>> Absent gjournal the extra read was unnecessary but harmless. The
>>> buffer exists with B_DELWRI so we will not read it, or it is locked
>>> while being written to disk, so we will wait for the write to complete,
>>> after which it is safe to read it. With gjournal there is a window of
>>> vulnerability between when the buffer is marked B_DELWRI and when it
>>> is actually written to disk. This is because gjournal will say the
>>> write has completed when it has not yet actually done so. Thus when
>>> we read it, we get incorrect results. But until gjournal actually
>>> completes the write, it will keep the buffer around (marked B_CACHE),
>>> so as long as we check for B_CACHE to avoid the read (as we should
>>> anyway to avoid an unnecessary read) we should always get consistent
>>> results.
>> I tend to disagree with the fix. What if the cached block being
>> discarded before we enter this code fragment ?
>> 
>> I believe the bug is indeed clear after your analysis, but it is in
>> gjournal.  Problem is that the disk (gjournal substrate layer, but for UFS
>> it is all disk) returns data different from what was written to the
>> volume.  After the write bio signalled completion, unless other write
>> was done after it, reads _must_ return the content of the write.
>> 
>> The right fix is for gjournal is to perform 'store forwarding', i.e.
>> to return data from the journal until journal is flushd to disk.
> 
> That makes sense to me and explains why the problem never occurs after a
> clean mount, it needs the foregoing "rm snapfile" to trigger the
> gjournal bug.
> 
> To be complete:
> 1. I run gjournal above gmirror.
> 2. I use the patch given in kern/198500.
> 3. In my loader.conf I have
>    # use max. 8GB (1/4 of vm.kmem_size) for gjournal cache
>    kern.geom.journal.cache.divisor="4"
> It is a little bit annoying, but setting kern.geom.journal.cache.limit
> in loader.conf does not work because its overwritten in gjournal_init().
> 
> -- 
> Dr. Andreas Longwitz

The sequence of calls when using bread is:

Function	Line	File
--------	----	----
bread		491	sys/buf.h
breadn_flags	1814	kern/vfs_bio.c
bstrategy	397	sys/buf.h
BO_STRATEGY	86	sys/bufobj.h
bufstrategy	4535	kern/vfs_bio.c
ufs_strategy	2290	ufs/ufs/ufs_vnops.c
BO_STRATEGY on filesystem device -> ffs_geom_strategy
ffs_geom_strategy 2141	ufs/ffs/ffs_vfsops.c
g_vfs_strategy	161	geom/geom_vfs.c
g_io_request	470	geom/geom_io.c

Whereas readblock skips all these steps and calls g_io_request
directly. In my looking at the gjournal code, I believe that we
will still enter it with the g_io_request() call as I believe that
it does not hook itself into any of the VOP_ call structure. but I
have not done a backtrace to confirm this fact. Assuming that we
are still getting into g_journal_start(), then it should be possible
to catch reads that are only in the log and pull out the data as
needed.

Another alternative is to send gjournal a request to flush its log
before starting the removal of a snapshot.

	Kirk McKusick