fsync: giving up on dirty on ufs partitions running vfs_write_suspend()
Andreas Longwitz
longwitz at incore.de
Fri Sep 22 10:02:22 UTC 2017
Konstantin Belousov schrieb:
> On Thu, Sep 21, 2017 at 10:58:46AM +0200, Andreas Longwitz wrote:
>> Konstantin Belousov wrote:
>>> On Sat, Sep 16, 2017 at 01:44:44PM +0200, Andreas Longwitz wrote:
>>>> Ok, I understand your thoughts about the "big loop" and I agree. On the
>>>> other side it is not easy to measure the progress of the dirty buffers
>>>> because these buffers a created from another process at the same time we
>>>> loop in vop_stdfsync(). I can explain from my tests, where I use the
>>>> following loop on a gjournaled partition:
>>>>
>>>> while true; do
>>>> cp -p bigfile bigfile.tmp
>>>> rm bigfile
>>>> mv bigfile.tmp bigfile
>>>> done
>>>>
>>>> When g_journal_switcher starts vfs_write_suspend() immediately after the
>>>> rm command has started to do his "rm stuff" (ufs_inactive, ffs_truncate,
>>>> ffs_indirtrunc at different levels, ffs_blkfree, ...) the we must loop
>>>> (that means wait) in vop_stdfsync() until the rm process has finished
>>>> his work. A lot of locking overhead is needed for coordination.
>>>> Returning from bufobj_wwait() we always see one left dirty buffer (very
>>>> seldom two), that is not optimal. Therefore I have tried the following
>>>> patch (instead of bumping maxretry):
>>>>
>>>> --- vfs_default.c.orig 2016-10-24 12:26:57.000000000 +0200
>>>> +++ vfs_default.c 2017-09-15 12:30:44.792274000 +0200
>>>> @@ -688,6 +688,8 @@
>>>> bremfree(bp);
>>>> bawrite(bp);
>>>> }
>>>> + if( maxretry < 1000)
>>>> + DELAY(waitns);
>>>> BO_LOCK(bo);
>>>> goto loop2;
>>>> }
>>>>
>>>> with different values for waitns. If I run the testloop 5000 times on my
>>>> testserver, the problem is triggered always round about 10 times. The
>>>> results from several runs are given in the following table:
>>>>
>>>> waitns max time max loops
>>>> -------------------------------
>>>> no DELAY 0,5 sec 8650 (maxres = 100000)
>>>> 1000 0,2 sec 24
>>>> 10000 0,8 sec 3
>>>> 100000 7,2 sec 3
>>>>
>>>> "time" means spent time in vop_stdfsync() measured from entry to return
>>>> by a dtrace script. "loops" means the number of times "--maxretry" is
>>>> executed. I am not sure if DELAY() is the best way to wait or if waiting
>>>> has other drawbacks. Anyway with DELAY() it does not take more than five
>>>> iterazions to finish.
>>> This is not explicitly stated in your message, but I suppose that the
>>> vop_stdfsync() is called due to VOP_FSYNC(devvp, MNT_SUSPEND) call in
>>> ffs_sync(). Am I right ?
>> Yes, the stack trace given by dtrace script looks always like this:
>> 4 22140 vop_stdfsync:entry
>> kernel`devfs_fsync+0x26
>> kernel`VOP_FSYNC_APV+0xa7
>> kernel`ffs_sync+0x3bb
>> kernel`vfs_write_suspend+0x1cd
>> geom_journal.ko`g_journal_switcher+0x9a4
>> kernel`fork_exit+0x9a
>> kernel`0xffffffff8095502e
>>
>>
>>> If yes, then the solution is most likely to continue looping in the
>>> vop_stdfsync() until there is no dirty buffers or the mount point
>>> mnt_secondary_writes counter is zero. The pauses trick you tried might
>>> be still useful, e.g. after some threshold of the performed loop
>>> iterations.
>> I have checked your proposal and found that indeed the
>> mnt_secondary_writes counter goes to zero when the dirties have reached
>> zero. During the loop the mnt_secondary_write counter is always equal to
>> one, so there is not something like a countdown and thats Kirk wanted to
>> see.
> This is because mnt_secondary_write counts number of threads which entered
> the vn_start_secondary_write() block and potentially can issue a write
> dirtying a buffer. In principle, some writer may start the secondary
> write block again even if the counter is zero, but practically some
> primary writer must make a modification for secondary writers to have
> work.
>
> I.e., the change would not cover the problem to claim it being completely
> solved, but for the current UFS code I doubt that the issue can be triggered.
>
>> A dtrace output (with DELAY of 1ms in the loop) for the biggest
>> loop count on a three day test is this:
>>
>> 18 32865 kern_unlinkat:entry path=bigfile, tid=101201,
>> tid=101201, execname=rm
>> 18 12747 ufs_remove:entry gj=mirror/gmbkp4p5.journal,
>> inum=11155630, blocks=22301568, size=11415525660
>> 18 12748 ufs_remove:return returncode=0, inum=11155630,
>> blocks=22301568
>> 18 18902 ffs_truncate:entry gj=mirror/gmbkp4p5.journal,
>> inum=11155630, size=11415525660, mnt_flag=0x12001040,
>> mnt_kern_flag=0x40006142, blocks=22301568
>> 6 33304 vfs_write_suspend:entry gj=mirror/gmbkp4p5.journal,
>> mnt_kern_flag=0x40006142, tid=100181
>> 6 22140 vop_stdfsync:entry mounted on /home, waitfor=1,
>> numoutput=0, clean=10, dirty=6, secondary_writes=1
>> 10 28117 bufobj_wwait:return calls to bufobj_wait = 1,
>> dirtycnt=2, secondary_writes=1
>> 10 28117 bufobj_wwait:return calls to bufobj_wait = 2,
>> dirtycnt=1, secondary_writes=1
>> 10 28117 bufobj_wwait:return calls to bufobj_wait = 3,
>> dirtycnt=1, secondary_writes=1
>> 10 28117 bufobj_wwait:return calls to bufobj_wait = 4,
>> dirtycnt=3, secondary_writes=1
>> 10 28117 bufobj_wwait:return calls to bufobj_wait = 5,
>> dirtycnt=2, secondary_writes=1
>> 6 28117 bufobj_wwait:return calls to bufobj_wait = 6,
>> dirtycnt=3, secondary_writes=1
>> 6 28117 bufobj_wwait:return calls to bufobj_wait = 7,
>> dirtycnt=3, secondary_writes=1
>> 6 28117 bufobj_wwait:return calls to bufobj_wait = 8,
>> dirtycnt=3, secondary_writes=1
>> 6 28117 bufobj_wwait:return calls to bufobj_wait = 9,
>> dirtycnt=3, secondary_writes=1
>> 6 28117 bufobj_wwait:return calls to bufobj_wait = 10,
>> dirtycnt=2, secondary_writes=1
>> 6 28117 bufobj_wwait:return calls to bufobj_wait = 11,
>> dirtycnt=2, secondary_writes=1
>> 6 28117 bufobj_wwait:return calls to bufobj_wait = 12,
>> dirtycnt=3, secondary_writes=1
>> 6 28117 bufobj_wwait:return calls to bufobj_wait = 13,
>> dirtycnt=3, secondary_writes=1
>> 6 28117 bufobj_wwait:return calls to bufobj_wait = 14,
>> dirtycnt=3, secondary_writes=1
>> 6 28117 bufobj_wwait:return calls to bufobj_wait = 15,
>> dirtycnt=4, secondary_writes=1
>> 6 28117 bufobj_wwait:return calls to bufobj_wait = 16,
>> dirtycnt=3, secondary_writes=1
>> 6 28117 bufobj_wwait:return calls to bufobj_wait = 17,
>> dirtycnt=3, secondary_writes=1
>> 2 18903 ffs_truncate:return returncode=0, inum=11155630, blocks=0
>> 2 32866 kern_unlinkat:return returncode=0, errno=0, number
>> io's: 791/791
>> 6 28117 bufobj_wwait:return calls to bufobj_wait = 18,
>> dirtycnt=3, secondary_writes=0
>> 6 28117 bufobj_wwait:return calls to bufobj_wait = 19,
>> dirtycnt=0, secondary_writes=0
>> 6 22141 vop_stdfsync:return returncode=0, pid=26, tid=100181,
>> spent 240373850 nsecs
>>
>> So the spent time in vop_stdfsync() is 0,24 sec in the worst case I
>> found using DELAY with 1 ms. I would prefer this solution. My first
>> appoach (simple bumping maxres from 1000 to 100000) is also ok, but max
>> spend time will be raise up to 0,5 sec. Perhaps you like something like
>>
>> if( maxretry < 1000 && maxretry % 10 = 0)
>> DELAY(waitns);
>>
>> That is also ok but does not make a noteworthy difference. The main
>> argument remains: we have to wait until there are no dirties left.
>>
>> For me the problem with the "giving up on dirty" is solved.
> Will you provide the patch ?
Patch against HEAD:
--- vfs_default.c.orig 2017-09-22 11:56:26.950084000 +0200
+++ vfs_default.c 2017-09-22 11:58:33.211196000 +0200
@@ -690,6 +690,8 @@
bremfree(bp);
bawrite(bp);
}
+ if( maxretry < 1000)
+ DELAY(1000); /* 1 ms */
BO_LOCK(bo);
goto loop2;
}
--
Andreas Longwitz
More information about the freebsd-fs
mailing list