ufs snapshot is sometimes corrupt on gjourneled partition

Sat Jul 29 15:24:51 UTC 2017

On Sat, Jul 29, 2017 at 12:38:13PM +0200, Andreas Longwitz wrote:
> hello Kirk and Kostik,
> 
> you are both right.
> 
> > So, I really do not understand how you can end up with a buffer with
> > invalid contents if B_CACHE is set because we are open coding bread()
> > here and that is the criterion it uses to read. If however, you do
> > the read when B_CACHE is not set, then as noted you will get the
> > wrong results and crash. But see also my comments below when B_CACHE
> > is not set.
> 
> The patch
> 
> -- ffs_snapshot.c.1st  2016-06-08 17:25:21.000000000 +0200
> +++ ffs_snapshot.c      2017-07-29 10:51:34.682067000 +0200
> @@ -1403,7 +1403,7 @@
>          */
>         bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0);
>         bp->b_blkno = fsbtodb(fs, blkno);
> -       if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 &&
> +       if ((bp->b_flags & (B_DONE | B_DELWRI | B_CACHE)) == 0 &&
>             (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) {
>                 brelse(bp);
>                 return (error);
> 
> is correct, the buffer has always the correct data, when B_CACHE is set
> after calling getblk(). So the calls to readblock() are unnecessary in
> this case. In my first test the crash of following rm probably was
> induced by other readblock() calls, but I did not know at that time.
Ok.  In fact it is not clear to me why B_DONE or B_DELWRI is tested at all
there.  We only need to know that the buffer content is valid.

> 
> > So the way to fix the bug is to read gjournal code and understand why
> > does it sometime returns wrong data.
> 
> Yes, gjournal is broken in handling his flush_queue. If we have 10 bio's
> in the flush_queue:
>          1 2 3 4 5 6 7 8 9 10
> and another 10 bio's goes to the flush queue before all the first 10
> bio's was removed from the flush queue we should have something like
>          6 7 8 9 10 11 12 13 14 15 16 17 18 19 20,
> but because of the bug we end up with
>          6 11 12 13 14  15 16 17 18 19 20 7 8 9 10.
> So the sequence of the bio's is damaged in the flush queue (and
> therefore in the journal an disk !). This error can be triggered by
> ffs_snapshot(), when a block is read with readblock() and gjournal finds
> this block in the broken flush queue before he goes to the correct
> active queue.
> 
> The following patch solved this problem for me:
> 
> --- g_journal.c.orig     2017-05-08 14:16:42.000000000 +0200
> +++ g_journal.c          2017-07-28 21:25:58.891881000 +0200
> @@ -1261,7 +1261,7 @@
>         strlcpy(hdr.jrh_magic, GJ_RECORD_HEADER_MAGIC,
> sizeof(hdr.jrh_magic));
> 
>         bioq = &sc->sc_active.jj_queue;
> -       pbp = sc->sc_flush_queue;
> +       GJQ_LAST(sc->sc_flush_queue, pbp);
> 
>         fbp = g_alloc_bio();
>         fbp->bio_parent = NULL;
> 
> with macro GJQ_LAST defined in g_journal.h:
> 
> #define GJQ_LAST(head, bp)      do {                             \
>   struct bio *_bp;                                               \
>                                                                  \
>   if ((head) == NULL) {                                          \
>        (bp) = (head);                                            \
>        break;                                                    \
>   }                                                              \
>   for (_bp = (head); _bp->bio_next != NULL; _bp = _bp->bio_next);\
>   (bp) = (_bp);                                                  \
> } while (0)
> 
> Some more annotations about g_journal.c:
> 
> - The patch given in bug 198500 is necessary to run gjournal with
> correct and adequate cache size.
> - The cache size for gjournal can be set by two tunables: cache.limit
> and cache.divisor. I do not know the best practice if there are two
> variables for setting the same resource. At the moment cache.limit is
> ignored at boot time, because it will be overwritten in g_journal_init.
> - In g_journal_flush() there is a writeonly variable "size".
> - If one process writes synchron a block and onother process reads the
> same block just after the write starts, then it is possible that the
> reader gets the new block before the writer has finished his write call.
> This happens, when the write goes to the delayed_queue for some time
> from where the read gets the data. But the write must wait until the bio
> goes to the current queue, where g_io_deliver() is called. If this is
> not intended, then the delayed queue should not be used for reads.

I recomment you to contact pjd at freebsd.org directly about both the gjournal
patch and WRT further observations you have.