ufs snapshot is sometimes corrupt on gjourneled partition
Konstantin Belousov
kostikbel at gmail.com
Sat Jul 29 15:24:51 UTC 2017
On Sat, Jul 29, 2017 at 12:38:13PM +0200, Andreas Longwitz wrote:
> hello Kirk and Kostik,
>
> you are both right.
>
> > So, I really do not understand how you can end up with a buffer with
> > invalid contents if B_CACHE is set because we are open coding bread()
> > here and that is the criterion it uses to read. If however, you do
> > the read when B_CACHE is not set, then as noted you will get the
> > wrong results and crash. But see also my comments below when B_CACHE
> > is not set.
>
> The patch
>
> -- ffs_snapshot.c.1st 2016-06-08 17:25:21.000000000 +0200
> +++ ffs_snapshot.c 2017-07-29 10:51:34.682067000 +0200
> @@ -1403,7 +1403,7 @@
> */
> bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0);
> bp->b_blkno = fsbtodb(fs, blkno);
> - if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 &&
> + if ((bp->b_flags & (B_DONE | B_DELWRI | B_CACHE)) == 0 &&
> (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) {
> brelse(bp);
> return (error);
>
> is correct, the buffer has always the correct data, when B_CACHE is set
> after calling getblk(). So the calls to readblock() are unnecessary in
> this case. In my first test the crash of following rm probably was
> induced by other readblock() calls, but I did not know at that time.
Ok. In fact it is not clear to me why B_DONE or B_DELWRI is tested at all
there. We only need to know that the buffer content is valid.
>
> > So the way to fix the bug is to read gjournal code and understand why
> > does it sometime returns wrong data.
>
> Yes, gjournal is broken in handling his flush_queue. If we have 10 bio's
> in the flush_queue:
> 1 2 3 4 5 6 7 8 9 10
> and another 10 bio's goes to the flush queue before all the first 10
> bio's was removed from the flush queue we should have something like
> 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20,
> but because of the bug we end up with
> 6 11 12 13 14 15 16 17 18 19 20 7 8 9 10.
> So the sequence of the bio's is damaged in the flush queue (and
> therefore in the journal an disk !). This error can be triggered by
> ffs_snapshot(), when a block is read with readblock() and gjournal finds
> this block in the broken flush queue before he goes to the correct
> active queue.
>
> The following patch solved this problem for me:
>
> --- g_journal.c.orig 2017-05-08 14:16:42.000000000 +0200
> +++ g_journal.c 2017-07-28 21:25:58.891881000 +0200
> @@ -1261,7 +1261,7 @@
> strlcpy(hdr.jrh_magic, GJ_RECORD_HEADER_MAGIC,
> sizeof(hdr.jrh_magic));
>
> bioq = &sc->sc_active.jj_queue;
> - pbp = sc->sc_flush_queue;
> + GJQ_LAST(sc->sc_flush_queue, pbp);
>
> fbp = g_alloc_bio();
> fbp->bio_parent = NULL;
>
> with macro GJQ_LAST defined in g_journal.h:
>
> #define GJQ_LAST(head, bp) do { \
> struct bio *_bp; \
> \
> if ((head) == NULL) { \
> (bp) = (head); \
> break; \
> } \
> for (_bp = (head); _bp->bio_next != NULL; _bp = _bp->bio_next);\
> (bp) = (_bp); \
> } while (0)
>
> Some more annotations about g_journal.c:
>
> - The patch given in bug 198500 is necessary to run gjournal with
> correct and adequate cache size.
> - The cache size for gjournal can be set by two tunables: cache.limit
> and cache.divisor. I do not know the best practice if there are two
> variables for setting the same resource. At the moment cache.limit is
> ignored at boot time, because it will be overwritten in g_journal_init.
> - In g_journal_flush() there is a writeonly variable "size".
> - If one process writes synchron a block and onother process reads the
> same block just after the write starts, then it is possible that the
> reader gets the new block before the writer has finished his write call.
> This happens, when the write goes to the delayed_queue for some time
> from where the read gets the data. But the write must wait until the bio
> goes to the current queue, where g_io_deliver() is called. If this is
> not intended, then the delayed queue should not be used for reads.
I recomment you to contact pjd at freebsd.org directly about both the gjournal
patch and WRT further observations you have.
More information about the freebsd-fs
mailing list