fsync: giving up on dirty on ufs partitions running vfs_write_suspend()
Andreas Longwitz
longwitz at incore.de
Thu Sep 21 08:59:00 UTC 2017
Konstantin Belousov wrote:
> On Sat, Sep 16, 2017 at 01:44:44PM +0200, Andreas Longwitz wrote:
>> Ok, I understand your thoughts about the "big loop" and I agree. On the
>> other side it is not easy to measure the progress of the dirty buffers
>> because these buffers a created from another process at the same time we
>> loop in vop_stdfsync(). I can explain from my tests, where I use the
>> following loop on a gjournaled partition:
>>
>> while true; do
>> cp -p bigfile bigfile.tmp
>> rm bigfile
>> mv bigfile.tmp bigfile
>> done
>>
>> When g_journal_switcher starts vfs_write_suspend() immediately after the
>> rm command has started to do his "rm stuff" (ufs_inactive, ffs_truncate,
>> ffs_indirtrunc at different levels, ffs_blkfree, ...) the we must loop
>> (that means wait) in vop_stdfsync() until the rm process has finished
>> his work. A lot of locking overhead is needed for coordination.
>> Returning from bufobj_wwait() we always see one left dirty buffer (very
>> seldom two), that is not optimal. Therefore I have tried the following
>> patch (instead of bumping maxretry):
>>
>> --- vfs_default.c.orig 2016-10-24 12:26:57.000000000 +0200
>> +++ vfs_default.c 2017-09-15 12:30:44.792274000 +0200
>> @@ -688,6 +688,8 @@
>> bremfree(bp);
>> bawrite(bp);
>> }
>> + if( maxretry < 1000)
>> + DELAY(waitns);
>> BO_LOCK(bo);
>> goto loop2;
>> }
>>
>> with different values for waitns. If I run the testloop 5000 times on my
>> testserver, the problem is triggered always round about 10 times. The
>> results from several runs are given in the following table:
>>
>> waitns max time max loops
>> -------------------------------
>> no DELAY 0,5 sec 8650 (maxres = 100000)
>> 1000 0,2 sec 24
>> 10000 0,8 sec 3
>> 100000 7,2 sec 3
>>
>> "time" means spent time in vop_stdfsync() measured from entry to return
>> by a dtrace script. "loops" means the number of times "--maxretry" is
>> executed. I am not sure if DELAY() is the best way to wait or if waiting
>> has other drawbacks. Anyway with DELAY() it does not take more than five
>> iterazions to finish.
> This is not explicitly stated in your message, but I suppose that the
> vop_stdfsync() is called due to VOP_FSYNC(devvp, MNT_SUSPEND) call in
> ffs_sync(). Am I right ?
Yes, the stack trace given by dtrace script looks always like this:
4 22140 vop_stdfsync:entry
kernel`devfs_fsync+0x26
kernel`VOP_FSYNC_APV+0xa7
kernel`ffs_sync+0x3bb
kernel`vfs_write_suspend+0x1cd
geom_journal.ko`g_journal_switcher+0x9a4
kernel`fork_exit+0x9a
kernel`0xffffffff8095502e
> If yes, then the solution is most likely to continue looping in the
> vop_stdfsync() until there is no dirty buffers or the mount point
> mnt_secondary_writes counter is zero. The pauses trick you tried might
> be still useful, e.g. after some threshold of the performed loop
> iterations.
I have checked your proposal and found that indeed the
mnt_secondary_writes counter goes to zero when the dirties have reached
zero. During the loop the mnt_secondary_write counter is always equal to
one, so there is not something like a countdown and thats Kirk wanted to
see. A dtrace output (with DELAY of 1ms in the loop) for the biggest
loop count on a three day test is this:
18 32865 kern_unlinkat:entry path=bigfile, tid=101201,
tid=101201, execname=rm
18 12747 ufs_remove:entry gj=mirror/gmbkp4p5.journal,
inum=11155630, blocks=22301568, size=11415525660
18 12748 ufs_remove:return returncode=0, inum=11155630,
blocks=22301568
18 18902 ffs_truncate:entry gj=mirror/gmbkp4p5.journal,
inum=11155630, size=11415525660, mnt_flag=0x12001040,
mnt_kern_flag=0x40006142, blocks=22301568
6 33304 vfs_write_suspend:entry gj=mirror/gmbkp4p5.journal,
mnt_kern_flag=0x40006142, tid=100181
6 22140 vop_stdfsync:entry mounted on /home, waitfor=1,
numoutput=0, clean=10, dirty=6, secondary_writes=1
10 28117 bufobj_wwait:return calls to bufobj_wait = 1,
dirtycnt=2, secondary_writes=1
10 28117 bufobj_wwait:return calls to bufobj_wait = 2,
dirtycnt=1, secondary_writes=1
10 28117 bufobj_wwait:return calls to bufobj_wait = 3,
dirtycnt=1, secondary_writes=1
10 28117 bufobj_wwait:return calls to bufobj_wait = 4,
dirtycnt=3, secondary_writes=1
10 28117 bufobj_wwait:return calls to bufobj_wait = 5,
dirtycnt=2, secondary_writes=1
6 28117 bufobj_wwait:return calls to bufobj_wait = 6,
dirtycnt=3, secondary_writes=1
6 28117 bufobj_wwait:return calls to bufobj_wait = 7,
dirtycnt=3, secondary_writes=1
6 28117 bufobj_wwait:return calls to bufobj_wait = 8,
dirtycnt=3, secondary_writes=1
6 28117 bufobj_wwait:return calls to bufobj_wait = 9,
dirtycnt=3, secondary_writes=1
6 28117 bufobj_wwait:return calls to bufobj_wait = 10,
dirtycnt=2, secondary_writes=1
6 28117 bufobj_wwait:return calls to bufobj_wait = 11,
dirtycnt=2, secondary_writes=1
6 28117 bufobj_wwait:return calls to bufobj_wait = 12,
dirtycnt=3, secondary_writes=1
6 28117 bufobj_wwait:return calls to bufobj_wait = 13,
dirtycnt=3, secondary_writes=1
6 28117 bufobj_wwait:return calls to bufobj_wait = 14,
dirtycnt=3, secondary_writes=1
6 28117 bufobj_wwait:return calls to bufobj_wait = 15,
dirtycnt=4, secondary_writes=1
6 28117 bufobj_wwait:return calls to bufobj_wait = 16,
dirtycnt=3, secondary_writes=1
6 28117 bufobj_wwait:return calls to bufobj_wait = 17,
dirtycnt=3, secondary_writes=1
2 18903 ffs_truncate:return returncode=0, inum=11155630, blocks=0
2 32866 kern_unlinkat:return returncode=0, errno=0, number
io's: 791/791
6 28117 bufobj_wwait:return calls to bufobj_wait = 18,
dirtycnt=3, secondary_writes=0
6 28117 bufobj_wwait:return calls to bufobj_wait = 19,
dirtycnt=0, secondary_writes=0
6 22141 vop_stdfsync:return returncode=0, pid=26, tid=100181,
spent 240373850 nsecs
So the spent time in vop_stdfsync() is 0,24 sec in the worst case I
found using DELAY with 1 ms. I would prefer this solution. My first
appoach (simple bumping maxres from 1000 to 100000) is also ok, but max
spend time will be raise up to 0,5 sec. Perhaps you like something like
if( maxretry < 1000 && maxretry % 10 = 0)
DELAY(waitns);
That is also ok but does not make a noteworthy difference. The main
argument remains: we have to wait until there are no dirties left.
For me the problem with the "giving up on dirty" is solved.
--
Andreas Longwitz
More information about the freebsd-fs
mailing list