fsync: giving up on dirty on ufs partitions running vfs_write_suspend()

Fri Sep 22 22:20:26 UTC 2017

On 22 Sep, Andreas Longwitz wrote:
> Konstantin Belousov schrieb:
>> On Thu, Sep 21, 2017 at 10:58:46AM +0200, Andreas Longwitz wrote:
>>> Konstantin Belousov wrote:
>>>> On Sat, Sep 16, 2017 at 01:44:44PM +0200, Andreas Longwitz wrote:
>>>>> Ok, I understand your thoughts about the "big loop" and I agree. On the
>>>>> other side it is not easy to measure the progress of the dirty buffers
>>>>> because these buffers a created from another process at the same time we
>>>>> loop in vop_stdfsync(). I can explain from my tests, where I use the
>>>>> following loop on a gjournaled partition:
>>>>>
>>>>>    while true; do
>>>>>       cp -p bigfile bigfile.tmp
>>>>>       rm bigfile
>>>>>       mv bigfile.tmp bigfile
>>>>>    done
>>>>>
>>>>> When g_journal_switcher starts vfs_write_suspend() immediately after the
>>>>> rm command has started to do his "rm stuff" (ufs_inactive, ffs_truncate,
>>>>> ffs_indirtrunc at different levels, ffs_blkfree, ...) the we must loop
>>>>> (that means wait) in vop_stdfsync() until the rm process has finished
>>>>> his work. A lot of locking overhead is needed for coordination.
>>>>> Returning from bufobj_wwait() we always see one left dirty buffer (very
>>>>> seldom two), that is not optimal. Therefore I have tried the following
>>>>> patch (instead of bumping maxretry):
>>>>>
>>>>> --- vfs_default.c.orig  2016-10-24 12:26:57.000000000 +0200
>>>>> +++ vfs_default.c       2017-09-15 12:30:44.792274000 +0200
>>>>> @@ -688,6 +688,8 @@
>>>>>                         bremfree(bp);
>>>>>                         bawrite(bp);
>>>>>                 }
>>>>> +               if( maxretry < 1000)
>>>>> +                       DELAY(waitns);
>>>>>                 BO_LOCK(bo);
>>>>>                 goto loop2;
>>>>>         }
>>>>>
>>>>> with different values for waitns. If I run the testloop 5000 times on my
>>>>> testserver, the problem is triggered always round about 10 times. The
>>>>> results from several runs are given in the following table:
>>>>>
>>>>>     waitns    max time   max loops
>>>>>     -------------------------------
>>>>>   no DELAY     0,5 sec    8650  (maxres = 100000)
>>>>>       1000     0,2 sec      24
>>>>>      10000     0,8 sec       3
>>>>>     100000     7,2 sec       3
>>>>>
>>>>> "time" means spent time in vop_stdfsync() measured from entry to return
>>>>> by a dtrace script. "loops" means the number of times "--maxretry" is
>>>>> executed. I am not sure if DELAY() is the best way to wait or if waiting
>>>>> has other drawbacks. Anyway with DELAY() it does not take more than five
>>>>> iterazions to finish.
>>>> This is not explicitly stated in your message, but I suppose that the
>>>> vop_stdfsync() is called due to VOP_FSYNC(devvp, MNT_SUSPEND) call in
>>>> ffs_sync().  Am I right ?
>>> Yes, the stack trace given by dtrace script looks always like this:
>>>  4  22140               vop_stdfsync:entry
>>>               kernel`devfs_fsync+0x26
>>>               kernel`VOP_FSYNC_APV+0xa7
>>>               kernel`ffs_sync+0x3bb
>>>               kernel`vfs_write_suspend+0x1cd
>>>               geom_journal.ko`g_journal_switcher+0x9a4
>>>               kernel`fork_exit+0x9a
>>>               kernel`0xffffffff8095502e
>>>
>>>
>>>> If yes, then the solution is most likely to continue looping in the
>>>> vop_stdfsync() until there is no dirty buffers or the mount point
>>>> mnt_secondary_writes counter is zero. The pauses trick you tried might
>>>> be still useful, e.g. after some threshold of the performed loop
>>>> iterations.
>>> I have checked your proposal and found that indeed the
>>> mnt_secondary_writes counter goes to zero when the dirties have reached
>>> zero. During the loop the mnt_secondary_write counter is always equal to
>>> one, so there is not something like a countdown and thats Kirk wanted to
>>> see.
>> This is because mnt_secondary_write counts number of threads which entered
>> the vn_start_secondary_write() block and potentially can issue a write
>> dirtying a buffer.  In principle, some writer may start the secondary
>> write block again even if the counter is zero, but practically some
>> primary writer must make a modification for secondary writers to have
>> work.
>> 
>> I.e., the change would not cover the problem to claim it being completely
>> solved, but for the current UFS code I doubt that the issue can be triggered.
>> 
>>> A dtrace output (with DELAY of 1ms in the loop) for the biggest
>>> loop count on a three day test is this:
>>>
>>>  18  32865      kern_unlinkat:entry    path=bigfile, tid=101201,
>>> tid=101201, execname=rm
>>>  18  12747         ufs_remove:entry    gj=mirror/gmbkp4p5.journal,
>>> inum=11155630, blocks=22301568, size=11415525660
>>>  18  12748        ufs_remove:return    returncode=0, inum=11155630,
>>> blocks=22301568
>>>  18  18902       ffs_truncate:entry    gj=mirror/gmbkp4p5.journal,
>>> inum=11155630, size=11415525660, mnt_flag=0x12001040,
>>> mnt_kern_flag=0x40006142, blocks=22301568
>>>   6  33304  vfs_write_suspend:entry    gj=mirror/gmbkp4p5.journal,
>>> mnt_kern_flag=0x40006142, tid=100181
>>>   6  22140       vop_stdfsync:entry    mounted on /home, waitfor=1,
>>> numoutput=0, clean=10, dirty=6, secondary_writes=1
>>>  10  28117      bufobj_wwait:return    calls to bufobj_wait = 1,
>>> dirtycnt=2, secondary_writes=1
>>>  10  28117      bufobj_wwait:return    calls to bufobj_wait = 2,
>>> dirtycnt=1, secondary_writes=1
>>>  10  28117      bufobj_wwait:return    calls to bufobj_wait = 3,
>>> dirtycnt=1, secondary_writes=1
>>>  10  28117      bufobj_wwait:return    calls to bufobj_wait = 4,
>>> dirtycnt=3, secondary_writes=1
>>>  10  28117      bufobj_wwait:return    calls to bufobj_wait = 5,
>>> dirtycnt=2, secondary_writes=1
>>>   6  28117      bufobj_wwait:return    calls to bufobj_wait = 6,
>>> dirtycnt=3, secondary_writes=1
>>>   6  28117      bufobj_wwait:return    calls to bufobj_wait = 7,
>>> dirtycnt=3, secondary_writes=1
>>>   6  28117      bufobj_wwait:return    calls to bufobj_wait = 8,
>>> dirtycnt=3, secondary_writes=1
>>>   6  28117      bufobj_wwait:return    calls to bufobj_wait = 9,
>>> dirtycnt=3, secondary_writes=1
>>>   6  28117      bufobj_wwait:return    calls to bufobj_wait = 10,
>>> dirtycnt=2, secondary_writes=1
>>>   6  28117      bufobj_wwait:return    calls to bufobj_wait = 11,
>>> dirtycnt=2, secondary_writes=1
>>>   6  28117      bufobj_wwait:return    calls to bufobj_wait = 12,
>>> dirtycnt=3, secondary_writes=1
>>>   6  28117      bufobj_wwait:return    calls to bufobj_wait = 13,
>>> dirtycnt=3, secondary_writes=1
>>>   6  28117      bufobj_wwait:return    calls to bufobj_wait = 14,
>>> dirtycnt=3, secondary_writes=1
>>>   6  28117      bufobj_wwait:return    calls to bufobj_wait = 15,
>>> dirtycnt=4, secondary_writes=1
>>>   6  28117      bufobj_wwait:return    calls to bufobj_wait = 16,
>>> dirtycnt=3, secondary_writes=1
>>>   6  28117      bufobj_wwait:return    calls to bufobj_wait = 17,
>>> dirtycnt=3, secondary_writes=1
>>>   2  18903      ffs_truncate:return    returncode=0, inum=11155630, blocks=0
>>>   2  32866     kern_unlinkat:return    returncode=0, errno=0, number
>>> io's: 791/791
>>>   6  28117      bufobj_wwait:return    calls to bufobj_wait = 18,
>>> dirtycnt=3, secondary_writes=0
>>>   6  28117      bufobj_wwait:return    calls to bufobj_wait = 19,
>>> dirtycnt=0, secondary_writes=0
>>>   6  22141      vop_stdfsync:return    returncode=0, pid=26, tid=100181,
>>> spent 240373850 nsecs
>>>
>>> So the spent time in vop_stdfsync() is 0,24 sec in the worst case I
>>> found using DELAY with 1 ms. I would prefer this solution. My first
>>> appoach (simple bumping maxres from 1000 to 100000) is also ok, but max
>>> spend time will be raise up to 0,5 sec. Perhaps you like something like
>>>
>>>    if( maxretry < 1000 && maxretry % 10 = 0)
>>>           DELAY(waitns);
>>>
>>> That is also ok but does not make a noteworthy difference. The main
>>> argument remains: we have to wait until there are no dirties left.
>>>
>>> For me the problem with the "giving up on dirty" is solved.
>> Will you provide the patch ?
> 
> Patch against HEAD:
> --- vfs_default.c.orig  2017-09-22 11:56:26.950084000 +0200
> +++ vfs_default.c       2017-09-22 11:58:33.211196000 +0200
> @@ -690,6 +690,8 @@
>                         bremfree(bp);
>                         bawrite(bp);
>                 }
> +               if( maxretry < 1000)
> +                       DELAY(1000); /* 1 ms */
>                 BO_LOCK(bo);
>                 goto loop2;
>         }

Do you need to use a busy loop here, or can you yield the cpu by using
something like pause(9)?