ufs snapshot is sometimes corrupt on gjourneled partition
Andreas Longwitz
longwitz at incore.de
Mon Jul 17 08:15:04 UTC 2017
> On Sun, Jul 16, 2017 at 09:33:55AM -0700, Kirk McKusick wrote:
>> I am copying Kostik on this email even though he is on freebsd-fs
>> as I would like him to review my answer below.
>>
>>> Date: Sun, 16 Jul 2017 17:52:39 +0200
>>> From: Andreas Longwitz <longwitz at incore.de>
>>> To: Kirk McKusick <mckusick at mckusick.com>
>>> CC: freebsd-fs at freebsd.org
>>> Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition
>>>
>>> Hello,
>>>
>>> I like to discuss the following code snippet from function
>>> indiracct_ufs2() in source file ffs_snapshot.c:
>>>
>>> /*
>>> * We have to expand bread here since it will deadlock looking
>>> * up the block number for any blocks that are not in the cache.
>>> */
>>> bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0);
>>> bp->b_blkno = fsbtodb(fs, blkno);
>>> if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 &&
>>> (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) {
>>> brelse(bp);
>>> return (error);
>>> }
>>>
>>> In all my tests (all mount types, good or bad result) the flags B_DONE
>>> and B_DELWRI in bp->b_flags were always zero, so the data buffer given by
>>> getblk() at bp->b_data is always overwritten with the data from the
>>> explicit I/O performed by readblock(). By the way the flag B_CACHE was
>>> always set.
>> Key observation here that B_CACHE is set! That means that the data
>> in the block is valid and does not need to be read again. So the fix
>> is this:
>>
>> Index: sys/ufs/ffs/ffs_snapshot.c
>> ===================================================================
>> --- sys/ufs/ffs/ffs_snapshot.c (revision 321045)
>> +++ sys/ufs/ffs/ffs_snapshot.c (working copy)
>> @@ -1394,7 +1394,7 @@
>> */
>> bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0);
>> bp->b_blkno = fsbtodb(fs, blkno);
>> - if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 &&
>> + if ((bp->b_flags & (B_DONE | B_DELWRI | B_CACHE)) == 0 &&
>> (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) {
>> brelse(bp);
>> return (error);
>>
This patch was my first try, because the comment in the code snippet
points to B_CACHE. Unfortunaly it did not work, because the getblk()
buffer does not always have the correct data, the following "rm
snapfile" crashed always immediately with
panic: ffs_blkfree_cg: freeing free block.
Therfore I came up with the CLUSTEROK hack.
>> Absent gjournal the extra read was unnecessary but harmless. The
>> buffer exists with B_DELWRI so we will not read it, or it is locked
>> while being written to disk, so we will wait for the write to complete,
>> after which it is safe to read it. With gjournal there is a window of
>> vulnerability between when the buffer is marked B_DELWRI and when it
>> is actually written to disk. This is because gjournal will say the
>> write has completed when it has not yet actually done so. Thus when
>> we read it, we get incorrect results. But until gjournal actually
>> completes the write, it will keep the buffer around (marked B_CACHE),
>> so as long as we check for B_CACHE to avoid the read (as we should
>> anyway to avoid an unnecessary read) we should always get consistent
>> results.
> I tend to disagree with the fix. What if the cached block being
> discarded before we enter this code fragment ?
>
> I believe the bug is indeed clear after your analysis, but it is in
> gjournal. Problem is that the disk (gjournal substrate layer, but for UFS
> it is all disk) returns data different from what was written to the
> volume. After the write bio signalled completion, unless other write
> was done after it, reads _must_ return the content of the write.
>
> The right fix is for gjournal is to perform 'store forwarding', i.e.
> to return data from the journal until journal is flushd to disk.
That makes sense to me and explains why the problem never occurs after a
clean mount, it needs the foregoing "rm snapfile" to trigger the
gjournal bug.
To be complete:
1. I run gjournal above gmirror.
2. I use the patch given in kern/198500.
3. In my loader.conf I have
# use max. 8GB (1/4 of vm.kmem_size) for gjournal cache
kern.geom.journal.cache.divisor="4"
It is a little bit annoying, but setting kern.geom.journal.cache.limit
in loader.conf does not work because its overwritten in gjournal_init().
--
Dr. Andreas Longwitz
More information about the freebsd-fs
mailing list