ufs snapshot is sometimes corrupt on gjourneled partition

Sat Jun 24 22:00:43 UTC 2017

> Date: Sat, 24 Jun 2017 18:30:17 +0300
> From: Konstantin Belousov <kostikbel at gmail.com>
> To: Andreas Longwitz <longwitz at incore.de>
> Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition
> 
> On Fri, Jun 23, 2017 at 11:14:43AM +0200, Andreas Longwitz wrote:
>> I try to understand the cause for the "free inode" problem described in
>>   https://lists.freebsd.org/pipermail/freebsd-fs/2013-November/018610.html
>> 
>> I have setup a test server (FreeBSD 10.3-STABLE #4 r317936) with a
>> gjournaled partition for /home:
>> mount -t ufs | grep /home -->
>> /dev/mirror/gmsvt7p10.journal on /home (ufs, asynchronous, local,
>> noatime, gjournal)
>
> As the first thing to try, if you perform your tests on the raw
> partition without gjournal, does the problem stay around ?

I concur that running your test without gjournal is the next test to try.
I think that your suspicion that there is a race condition with gjournal
is likely correct. And if that is true, then the problem will go away
when gjournal is taken out of the stack.

>> My test creates one snapshot of /home (gets alway inode 4) and removes
>> this snapshot:
>> 
>>   for i in 1 2 3 4 5 6 7 8; do
>>      echo "starting snaptest $i" >/dev/console
>>      mount -u -o snapshot -o noatime -o async /home/.snap/fscktest /home
>>      echo $(ls -ils /home/.snap/fscktest) >/dev/console
>>      rm -f /home/.snap/fscktest
>>   done
>> 
>> I never have more than this one snapshot at work and during the test I
>> never have any other
>> user processes working on /home. A typical output looks like this:
>> 
>> Jun 21 15:59:52 root: starting snaptest 1
>> Jun 21 15:59:52 root: 4 26592 -r--------  1 root  operator  90762970240
>> 21 Jun 15:59 /home/.snap/fscktest
>> Jun 21 15:59:53 root: starting snaptest 2
>> Jun 21 15:59:53 root: 4 26592 -r--------  1 root  operator  90762970152
>> 21 Jun 15:59 /home/.snap/fscktest
>> Jun 21 15:59:54 kernel: freeing inode /home/4 with 704 blocks
>> Jun 21 15:59:54 root: starting snaptest 3
>> Jun 21 15:59:54 kernel: free inode /home/4 had 704 blocks
>> Jun 21 15:59:54 root: 4 26592 -r--------  1 root  operator  90762969976
>> 21 Jun 15:59 /home/.snap/fscktest
>> Jun 21 15:59:56 kernel: freeing inode /home/4 with 2112 blocks
>> Jun 21 15:59:56 root: starting snaptest 4
>> Jun 21 15:59:56 kernel: free inode /home/4 had 2112 blocks
>> Jun 21 15:59:56 root: 4 26592 -r--------  1 root  operator  90762970240
>> 21 Jun 15:59 /home/.snap/fscktest
>> Jun 21 15:59:57 root: starting snaptest 5
>> Jun 21 15:59:57 root: 4 26592 -r--------  1 root  operator  90762970240
>> 21 Jun 15:59 /home/.snap/fscktest
>> Jun 21 15:59:58 root: starting snaptest 6
>> Jun 21 15:59:58 root: 4 26592 -r--------  1 root  operator  90762970216
>> 21 Jun 15:59 /home/.snap/fscktest
>> Jun 21 15:59:59 kernel: freeing inode /home/4 with 192 blocks
>> Jun 21 15:59:59 root: starting snaptest 7
>> Jun 21 15:59:59 kernel: free inode /home/4 had 192 blocks
>> Jun 21 15:59:59 root: 4 26592 -r--------  1 root  operator  90762970240
>> 21 Jun 16:00 /home/.snap/fscktest
>> Jun 21 16:00:00 root: starting snaptest 8
>> Jun 21 16:00:00 root: 4 26592 -r--------  1 root  operator  90762970240
>> 21 Jun 16:00 /home/.snap/fscktest
>> 
>> The "free inode /home/4 had NNN blocks" message during run of the mount
>> command is output of ffs_valloc(), because ffs_load_inode() has load the
>> disk inode 4 with a non zero i_blocks field. The corresponding "freeing
>> inode /home/4 with NNN blocks" message during the previous rm command
>> is output of my following diagnostic patch in function ffs_truncate():
>> 
>> --- ffs_inode.c.1st     2016-06-08 17:25:21.000000000 +0200
>> +++ ffs_inode.c 2017-06-19 10:02:07.145360000 +0200
>> @@ -551,6 +551,9 @@
>>                 DIP_SET(ip, i_blocks, DIP(ip, i_blocks) - blocksreleased);
>>         else    /* sanity */
>>                 DIP_SET(ip, i_blocks, 0);
>> +       if (bootverbose == 2 && DIP(ip, i_blocks) > 0)
>> +               printf("freeing inode %s/%lu with %ld blocks\n",
>> +                       fs->fs_fsmnt, (u_long)ip->i_number,
>> (long)DIP(ip, i_blocks));
>>         ip->i_flag |= IN_CHANGE;
>>  #ifdef QUOTA
>>         (void) chkdq(ip, -blocksreleased, NOCRED, 0);
>> 
>> The rm command can only free all the blocks of the snapshotfile (means
>> i_blocks for inode 4 ends with zero) , if this file has the "correct" size:
>> 
>> ls -ils /home/.snap/fscktest -->
>> 4 53184 -r--------  1 root  operator  90762970240 Jun 17 06:15
>> /home/.snap/fscktest
>> 
>> The size of the /home partition is given by
>> diskinfo /dev/mirror/gmsvt7p10.journal -->
>> /dev/mirror/gmsvt7p10.journal   512     90762954240     177271395
>> 
>> So we have 2769865 full 32kB blocks with size 90631864320. During
>> creating a snapshot a "last block" (32kB) is written at this offset
>> ending at 90762969088. Finally the snapblklist is written with
>> VOP_WRITE: "Write out the list of allocated blocks to the end of the
>> snapshot". In all my correct tests the table snapblklist is 1152 bytes
>> in size giving the correct size of the snapshot file : 90762970240. In
>> this case the table snapblklist has 144 entries of length 8: one lenght
>> entry and 143 logical block numbers recorded in mapacct_ufs2():
>> 
>>           if (acctit && expungetype == BLK_SNAP && blkno != BLK_SNAP)
>>                    *ip->i_snapblklist++ = lblkno;
>> 
>> The console output above shows three error situations with block
>> counters 704, 2112 and 192. Dividing these values by 8 gives exactly the
>> reduced size of the snapblocklist at the end of the snapshotfile, so in
>> these cases the snapshotfile is corrupt.
>
> I am not sure what do you mean by 'match' there.  Could you explicitely
> mention what relations between snapshot size and leaked blocks of the
> free snapshot inode did you noted ?

I too am confused here. Are you saying for example that 192 / 8 == 24
and that the snapblocklist is short by 24 blocks? Because from the table
below, it appears to be short by only 3 blocks.

>> I use a test kernel with some extra counters ct_* in mapacct_ufs2():
>> 
>>            ++ct_blkno_all;
>>            if (blkno == 0)
>>                    ++ct_blkno_0;
>>            if (blkno == BLK_NOCOPY)
>>                    ++ct_blkno_nocopy;
>>            if (blkno == BLK_SNAP)
>>                    ++ct_blkno_snap;
>>            if (blkno == 0 || blkno == BLK_NOCOPY)
>>                    continue;
>>            if (acctit && expungetype == BLK_SNAP && blkno != BLK_SNAP) {
>>                    *ip->i_snapblklist++ = lblkno;
>>                    ++ct_snapblklist;
>>            }
>>            if (blkno == BLK_SNAP)
>>                    blkno = blkstofrags(fs, lblkno);
>>            ++ct_blkfree;
>>            ffs_blkfree(ip->i_ump, fs, vp, blkno, fs->fs_bsize, inum,
>>                vp->v_type, NULL);
>> 
>> and for the 8 test runs shown above I can see these results using DTrace
>> at probe expunge_ufs2:return (blkno_snap is always 0):
>> 
>>  test   blkno_all blkno_0 blkno_nocopy snapblklist blkfree cg_nocopy
>>  -------------------------------------------------------------------
>>  1 ok   2770545   353320  2416404      143         821     2416404
>>  2 bad  2770545   587860  2181875      132         810     2416404
>>  3 bad  2770545   956582  1813175      110         788     2416393
>>  4 ok   2770545   353364  2416360      143         821     2416360
>>  5 ok   2770545   353364  2416360      143         821     2416360
>>  6 bad  2770545   418376  2351351      140         818     2416360
>>  7 ok   2770545   353367  2416357      143         821     2416357
>>  8 ok   2770545   353367  2416357      143         821     2416357
>> 
>> For correct tests the sum of blkno_0 and blkno_nocopy is always the same
>> (2769724), for bad tests especially the counter for blkno_nocopy is
>> significant lower. In the test table I give one more column cg_nocopy
>> for a counter I have added in cgaccount() to see how many entries are
>> set to BLK_NOCOPY during copy of cylinder group maps:
>> 
>>                 if (ffs_isblock(fs, cg_blksfree(cgp), loc)) {
>>                         ++ct_cg_nocopy;
>>                         DIP_SET(ip, i_db[loc], BLK_NOCOPY);
>>                 }
>>                 ...
>>                 if (ffs_isblock(fs, cg_blksfree(cgp), loc)) {
>>                         ++ct_cg_nocopy;
>>                         ((ufs2_daddr_t *)(ibp->b_data))[indiroff] =
>> BLK_NOCOPY;
>>                 }
>> 
>> For correct tests all the BLK_NOCOPY's which are set in cgaccount() can
>> later be seen in mapacct_ufs2(), for bad tests many of the BLK_NOCOPY's
>> have changed to 0.
>> 
>> I looks like the rm command removing the previous snapshot in some way
>> runs "in the background" simultan to expunge_ufs2() and changes some of
>> the BLK_NOCOPY's to zero. So this may be a buffer management problem
>> which only exists on gjourneled partitions, maybe getblk/readblock used
>> in indiracct_ufs2() is not compatibel with gjournel in the special case
>> of creating or removing a spapshot. A hint in this direction is the
>> fact, that the first test after cleaning the partition with
>>         umount /home; fsck -y /home; mount /home
>> always succeeds. The following modified test procedure never fails:
>> 
>>   for i in 1 2 3 4 5 6 7 8; do
>>      echo "starting snaptest $i" >/dev/console
>>      mount -u -o snapshot -o noatime -o async /home/.snap/fscktest /home
>>      echo $(ls -ils /home/.snap/fscktest) >/dev/console
>>      rm -f /home/.snap/fscktest /home
>>      umount /home
>>      mount /home
>>   done
>
> After the allocations of required blocks for the snapshot inode
> are finished, the filesystem is suspended. You can see the
> call to vfs_write_suspend() in the ffs_snapshot() where the
> suspension is enforced. As part of the suspension, all soft-update
> workitems are flushed, this is done by vfs_write_suspend() calling
> VFS_SYNC(MNT_SUSPEND).
> 
> UFS classifies writers into primary and secondary. Primary are mostly
> the writes initiated by the top-level syscall entries, like direct
> calls to write(2) or metadata-changing ops mkdir(), create() and so on.
> Secondary are writes performed when system initiates metadata updates
> during inactivation, quota updates, softdep background processing and
> similar. Primary modifications are blocked outright on suspension, while
> secondary are waited to finish in the mentioned VFS_SYNC(MNT_SUSPEND)
> call.
> 
> If you can provide a proof that some SU-related activity happens after the
> suspension is established, this would be interesting to see.
> Might be it is something different and much simpler, but I do not see
> an obvious mistake in the code, after reading your observations.

The mount information at the top shows:

/dev/mirror/gmsvt7p10.journal on /home (ufs, asynchronous, local, noatime, gjournal)

Thus no soft updates are being used. We are running on just a basic UFS
filesystem. So, soft update processing has no relevance here.

>> Another proof that the snapshot file is corrupt when the snapblklist is
>> shortend is the fact that the rm command sporadically panics in a kernel
>> routine that is known to be correct:
>> 
>> nread portion of the kernel message buffer:
>> dev = mirror/gmsvt7p10.journal, block = 19727560, fs = /home
>> panic: ffs_blkfree_cg: freeing free block
>> cpuid = 1
>> KDB: stack backtrace:
>> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
>> 0xfffffe0857e3b1c0
>> kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0857e3b270
>> vpanic() at vpanic+0x126/frame 0xfffffe0857e3b2b0
>> panic() at panic+0x43/frame 0xfffffe0857e3b310
>> ffs_blkfree_cg() at ffs_blkfree_cg+0x5c6/frame 0xfffffe0857e3b3d0
>> ffs_blkfree() at ffs_blkfree+0x99/frame 0xfffffe0857e3b430
>> ffs_indirtrunc() at ffs_indirtrunc+0x474/frame 0xfffffe0857e3b510
>> ffs_indirtrunc() at ffs_indirtrunc+0x423/frame 0xfffffe0857e3b5f0
>> ffs_truncate() at ffs_truncate+0x10b4/frame 0xfffffe0857e3b7d0
>> ufs_inactive() at ufs_inactive+0x16b/frame 0xfffffe0857e3b810
>> VOP_INACTIVE_APV() at VOP_INACTIVE_APV+0xf7/frame 0xfffffe0857e3b840
>> vinactive() at vinactive+0xc6/frame 0xfffffe0857e3b890
>> vputx() at vputx+0x27a/frame 0xfffffe0857e3b8f0
>> kern_unlinkat() at kern_unlinkat+0x243/frame 0xfffffe0857e3bae0
>> amd64_syscall() at amd64_syscall+0x2c6/frame 0xfffffe0857e3bbf0
>> Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0857e3bbf0
>> --- syscall (10, FreeBSD ELF64, sys_unlink), rip = 0x80095425a, rsp =
>> 0x7fffffffe988, rbp = 0x7fffffffea20 ---
>> 
>> Any hints solving this problem are welcome.

Per the suggestion at the top, I recommend trying your test without
gjournal present to see if the bug goes away. If that is true, then
we can focus our attention on a possible race during rm in the gjournal
code.

	Kirk McKusick