ufs snapshot is sometimes corrupt on gjourneled partition
Kirk McKusick
mckusick at mckusick.com
Sat Jun 24 22:00:43 UTC 2017
> Date: Sat, 24 Jun 2017 18:30:17 +0300
> From: Konstantin Belousov <kostikbel at gmail.com>
> To: Andreas Longwitz <longwitz at incore.de>
> Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition
>
> On Fri, Jun 23, 2017 at 11:14:43AM +0200, Andreas Longwitz wrote:
>> I try to understand the cause for the "free inode" problem described in
>> https://lists.freebsd.org/pipermail/freebsd-fs/2013-November/018610.html
>>
>> I have setup a test server (FreeBSD 10.3-STABLE #4 r317936) with a
>> gjournaled partition for /home:
>> mount -t ufs | grep /home -->
>> /dev/mirror/gmsvt7p10.journal on /home (ufs, asynchronous, local,
>> noatime, gjournal)
>
> As the first thing to try, if you perform your tests on the raw
> partition without gjournal, does the problem stay around ?
I concur that running your test without gjournal is the next test to try.
I think that your suspicion that there is a race condition with gjournal
is likely correct. And if that is true, then the problem will go away
when gjournal is taken out of the stack.
>> My test creates one snapshot of /home (gets alway inode 4) and removes
>> this snapshot:
>>
>> for i in 1 2 3 4 5 6 7 8; do
>> echo "starting snaptest $i" >/dev/console
>> mount -u -o snapshot -o noatime -o async /home/.snap/fscktest /home
>> echo $(ls -ils /home/.snap/fscktest) >/dev/console
>> rm -f /home/.snap/fscktest
>> done
>>
>> I never have more than this one snapshot at work and during the test I
>> never have any other
>> user processes working on /home. A typical output looks like this:
>>
>> Jun 21 15:59:52 root: starting snaptest 1
>> Jun 21 15:59:52 root: 4 26592 -r-------- 1 root operator 90762970240
>> 21 Jun 15:59 /home/.snap/fscktest
>> Jun 21 15:59:53 root: starting snaptest 2
>> Jun 21 15:59:53 root: 4 26592 -r-------- 1 root operator 90762970152
>> 21 Jun 15:59 /home/.snap/fscktest
>> Jun 21 15:59:54 kernel: freeing inode /home/4 with 704 blocks
>> Jun 21 15:59:54 root: starting snaptest 3
>> Jun 21 15:59:54 kernel: free inode /home/4 had 704 blocks
>> Jun 21 15:59:54 root: 4 26592 -r-------- 1 root operator 90762969976
>> 21 Jun 15:59 /home/.snap/fscktest
>> Jun 21 15:59:56 kernel: freeing inode /home/4 with 2112 blocks
>> Jun 21 15:59:56 root: starting snaptest 4
>> Jun 21 15:59:56 kernel: free inode /home/4 had 2112 blocks
>> Jun 21 15:59:56 root: 4 26592 -r-------- 1 root operator 90762970240
>> 21 Jun 15:59 /home/.snap/fscktest
>> Jun 21 15:59:57 root: starting snaptest 5
>> Jun 21 15:59:57 root: 4 26592 -r-------- 1 root operator 90762970240
>> 21 Jun 15:59 /home/.snap/fscktest
>> Jun 21 15:59:58 root: starting snaptest 6
>> Jun 21 15:59:58 root: 4 26592 -r-------- 1 root operator 90762970216
>> 21 Jun 15:59 /home/.snap/fscktest
>> Jun 21 15:59:59 kernel: freeing inode /home/4 with 192 blocks
>> Jun 21 15:59:59 root: starting snaptest 7
>> Jun 21 15:59:59 kernel: free inode /home/4 had 192 blocks
>> Jun 21 15:59:59 root: 4 26592 -r-------- 1 root operator 90762970240
>> 21 Jun 16:00 /home/.snap/fscktest
>> Jun 21 16:00:00 root: starting snaptest 8
>> Jun 21 16:00:00 root: 4 26592 -r-------- 1 root operator 90762970240
>> 21 Jun 16:00 /home/.snap/fscktest
>>
>> The "free inode /home/4 had NNN blocks" message during run of the mount
>> command is output of ffs_valloc(), because ffs_load_inode() has load the
>> disk inode 4 with a non zero i_blocks field. The corresponding "freeing
>> inode /home/4 with NNN blocks" message during the previous rm command
>> is output of my following diagnostic patch in function ffs_truncate():
>>
>> --- ffs_inode.c.1st 2016-06-08 17:25:21.000000000 +0200
>> +++ ffs_inode.c 2017-06-19 10:02:07.145360000 +0200
>> @@ -551,6 +551,9 @@
>> DIP_SET(ip, i_blocks, DIP(ip, i_blocks) - blocksreleased);
>> else /* sanity */
>> DIP_SET(ip, i_blocks, 0);
>> + if (bootverbose == 2 && DIP(ip, i_blocks) > 0)
>> + printf("freeing inode %s/%lu with %ld blocks\n",
>> + fs->fs_fsmnt, (u_long)ip->i_number,
>> (long)DIP(ip, i_blocks));
>> ip->i_flag |= IN_CHANGE;
>> #ifdef QUOTA
>> (void) chkdq(ip, -blocksreleased, NOCRED, 0);
>>
>> The rm command can only free all the blocks of the snapshotfile (means
>> i_blocks for inode 4 ends with zero) , if this file has the "correct" size:
>>
>> ls -ils /home/.snap/fscktest -->
>> 4 53184 -r-------- 1 root operator 90762970240 Jun 17 06:15
>> /home/.snap/fscktest
>>
>> The size of the /home partition is given by
>> diskinfo /dev/mirror/gmsvt7p10.journal -->
>> /dev/mirror/gmsvt7p10.journal 512 90762954240 177271395
>>
>> So we have 2769865 full 32kB blocks with size 90631864320. During
>> creating a snapshot a "last block" (32kB) is written at this offset
>> ending at 90762969088. Finally the snapblklist is written with
>> VOP_WRITE: "Write out the list of allocated blocks to the end of the
>> snapshot". In all my correct tests the table snapblklist is 1152 bytes
>> in size giving the correct size of the snapshot file : 90762970240. In
>> this case the table snapblklist has 144 entries of length 8: one lenght
>> entry and 143 logical block numbers recorded in mapacct_ufs2():
>>
>> if (acctit && expungetype == BLK_SNAP && blkno != BLK_SNAP)
>> *ip->i_snapblklist++ = lblkno;
>>
>> The console output above shows three error situations with block
>> counters 704, 2112 and 192. Dividing these values by 8 gives exactly the
>> reduced size of the snapblocklist at the end of the snapshotfile, so in
>> these cases the snapshotfile is corrupt.
>
> I am not sure what do you mean by 'match' there. Could you explicitely
> mention what relations between snapshot size and leaked blocks of the
> free snapshot inode did you noted ?
I too am confused here. Are you saying for example that 192 / 8 == 24
and that the snapblocklist is short by 24 blocks? Because from the table
below, it appears to be short by only 3 blocks.
>> I use a test kernel with some extra counters ct_* in mapacct_ufs2():
>>
>> ++ct_blkno_all;
>> if (blkno == 0)
>> ++ct_blkno_0;
>> if (blkno == BLK_NOCOPY)
>> ++ct_blkno_nocopy;
>> if (blkno == BLK_SNAP)
>> ++ct_blkno_snap;
>> if (blkno == 0 || blkno == BLK_NOCOPY)
>> continue;
>> if (acctit && expungetype == BLK_SNAP && blkno != BLK_SNAP) {
>> *ip->i_snapblklist++ = lblkno;
>> ++ct_snapblklist;
>> }
>> if (blkno == BLK_SNAP)
>> blkno = blkstofrags(fs, lblkno);
>> ++ct_blkfree;
>> ffs_blkfree(ip->i_ump, fs, vp, blkno, fs->fs_bsize, inum,
>> vp->v_type, NULL);
>>
>> and for the 8 test runs shown above I can see these results using DTrace
>> at probe expunge_ufs2:return (blkno_snap is always 0):
>>
>> test blkno_all blkno_0 blkno_nocopy snapblklist blkfree cg_nocopy
>> -------------------------------------------------------------------
>> 1 ok 2770545 353320 2416404 143 821 2416404
>> 2 bad 2770545 587860 2181875 132 810 2416404
>> 3 bad 2770545 956582 1813175 110 788 2416393
>> 4 ok 2770545 353364 2416360 143 821 2416360
>> 5 ok 2770545 353364 2416360 143 821 2416360
>> 6 bad 2770545 418376 2351351 140 818 2416360
>> 7 ok 2770545 353367 2416357 143 821 2416357
>> 8 ok 2770545 353367 2416357 143 821 2416357
>>
>> For correct tests the sum of blkno_0 and blkno_nocopy is always the same
>> (2769724), for bad tests especially the counter for blkno_nocopy is
>> significant lower. In the test table I give one more column cg_nocopy
>> for a counter I have added in cgaccount() to see how many entries are
>> set to BLK_NOCOPY during copy of cylinder group maps:
>>
>> if (ffs_isblock(fs, cg_blksfree(cgp), loc)) {
>> ++ct_cg_nocopy;
>> DIP_SET(ip, i_db[loc], BLK_NOCOPY);
>> }
>> ...
>> if (ffs_isblock(fs, cg_blksfree(cgp), loc)) {
>> ++ct_cg_nocopy;
>> ((ufs2_daddr_t *)(ibp->b_data))[indiroff] =
>> BLK_NOCOPY;
>> }
>>
>> For correct tests all the BLK_NOCOPY's which are set in cgaccount() can
>> later be seen in mapacct_ufs2(), for bad tests many of the BLK_NOCOPY's
>> have changed to 0.
>>
>> I looks like the rm command removing the previous snapshot in some way
>> runs "in the background" simultan to expunge_ufs2() and changes some of
>> the BLK_NOCOPY's to zero. So this may be a buffer management problem
>> which only exists on gjourneled partitions, maybe getblk/readblock used
>> in indiracct_ufs2() is not compatibel with gjournel in the special case
>> of creating or removing a spapshot. A hint in this direction is the
>> fact, that the first test after cleaning the partition with
>> umount /home; fsck -y /home; mount /home
>> always succeeds. The following modified test procedure never fails:
>>
>> for i in 1 2 3 4 5 6 7 8; do
>> echo "starting snaptest $i" >/dev/console
>> mount -u -o snapshot -o noatime -o async /home/.snap/fscktest /home
>> echo $(ls -ils /home/.snap/fscktest) >/dev/console
>> rm -f /home/.snap/fscktest /home
>> umount /home
>> mount /home
>> done
>
> After the allocations of required blocks for the snapshot inode
> are finished, the filesystem is suspended. You can see the
> call to vfs_write_suspend() in the ffs_snapshot() where the
> suspension is enforced. As part of the suspension, all soft-update
> workitems are flushed, this is done by vfs_write_suspend() calling
> VFS_SYNC(MNT_SUSPEND).
>
> UFS classifies writers into primary and secondary. Primary are mostly
> the writes initiated by the top-level syscall entries, like direct
> calls to write(2) or metadata-changing ops mkdir(), create() and so on.
> Secondary are writes performed when system initiates metadata updates
> during inactivation, quota updates, softdep background processing and
> similar. Primary modifications are blocked outright on suspension, while
> secondary are waited to finish in the mentioned VFS_SYNC(MNT_SUSPEND)
> call.
>
> If you can provide a proof that some SU-related activity happens after the
> suspension is established, this would be interesting to see.
> Might be it is something different and much simpler, but I do not see
> an obvious mistake in the code, after reading your observations.
The mount information at the top shows:
/dev/mirror/gmsvt7p10.journal on /home (ufs, asynchronous, local, noatime, gjournal)
Thus no soft updates are being used. We are running on just a basic UFS
filesystem. So, soft update processing has no relevance here.
>> Another proof that the snapshot file is corrupt when the snapblklist is
>> shortend is the fact that the rm command sporadically panics in a kernel
>> routine that is known to be correct:
>>
>> nread portion of the kernel message buffer:
>> dev = mirror/gmsvt7p10.journal, block = 19727560, fs = /home
>> panic: ffs_blkfree_cg: freeing free block
>> cpuid = 1
>> KDB: stack backtrace:
>> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
>> 0xfffffe0857e3b1c0
>> kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0857e3b270
>> vpanic() at vpanic+0x126/frame 0xfffffe0857e3b2b0
>> panic() at panic+0x43/frame 0xfffffe0857e3b310
>> ffs_blkfree_cg() at ffs_blkfree_cg+0x5c6/frame 0xfffffe0857e3b3d0
>> ffs_blkfree() at ffs_blkfree+0x99/frame 0xfffffe0857e3b430
>> ffs_indirtrunc() at ffs_indirtrunc+0x474/frame 0xfffffe0857e3b510
>> ffs_indirtrunc() at ffs_indirtrunc+0x423/frame 0xfffffe0857e3b5f0
>> ffs_truncate() at ffs_truncate+0x10b4/frame 0xfffffe0857e3b7d0
>> ufs_inactive() at ufs_inactive+0x16b/frame 0xfffffe0857e3b810
>> VOP_INACTIVE_APV() at VOP_INACTIVE_APV+0xf7/frame 0xfffffe0857e3b840
>> vinactive() at vinactive+0xc6/frame 0xfffffe0857e3b890
>> vputx() at vputx+0x27a/frame 0xfffffe0857e3b8f0
>> kern_unlinkat() at kern_unlinkat+0x243/frame 0xfffffe0857e3bae0
>> amd64_syscall() at amd64_syscall+0x2c6/frame 0xfffffe0857e3bbf0
>> Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0857e3bbf0
>> --- syscall (10, FreeBSD ELF64, sys_unlink), rip = 0x80095425a, rsp =
>> 0x7fffffffe988, rbp = 0x7fffffffea20 ---
>>
>> Any hints solving this problem are welcome.
Per the suggestion at the top, I recommend trying your test without
gjournal present to see if the bug goes away. If that is true, then
we can focus our attention on a possible race during rm in the gjournal
code.
Kirk McKusick
More information about the freebsd-fs
mailing list