fsync: giving up on dirty on ufs partitions running vfs_write_suspend()

Thu Sep 21 08:59:00 UTC 2017

Konstantin Belousov wrote:
> On Sat, Sep 16, 2017 at 01:44:44PM +0200, Andreas Longwitz wrote:
>> Ok, I understand your thoughts about the "big loop" and I agree. On the
>> other side it is not easy to measure the progress of the dirty buffers
>> because these buffers a created from another process at the same time we
>> loop in vop_stdfsync(). I can explain from my tests, where I use the
>> following loop on a gjournaled partition:
>>
>>    while true; do
>>       cp -p bigfile bigfile.tmp
>>       rm bigfile
>>       mv bigfile.tmp bigfile
>>    done
>>
>> When g_journal_switcher starts vfs_write_suspend() immediately after the
>> rm command has started to do his "rm stuff" (ufs_inactive, ffs_truncate,
>> ffs_indirtrunc at different levels, ffs_blkfree, ...) the we must loop
>> (that means wait) in vop_stdfsync() until the rm process has finished
>> his work. A lot of locking overhead is needed for coordination.
>> Returning from bufobj_wwait() we always see one left dirty buffer (very
>> seldom two), that is not optimal. Therefore I have tried the following
>> patch (instead of bumping maxretry):
>>
>> --- vfs_default.c.orig  2016-10-24 12:26:57.000000000 +0200
>> +++ vfs_default.c       2017-09-15 12:30:44.792274000 +0200
>> @@ -688,6 +688,8 @@
>>                         bremfree(bp);
>>                         bawrite(bp);
>>                 }
>> +               if( maxretry < 1000)
>> +                       DELAY(waitns);
>>                 BO_LOCK(bo);
>>                 goto loop2;
>>         }
>>
>> with different values for waitns. If I run the testloop 5000 times on my
>> testserver, the problem is triggered always round about 10 times. The
>> results from several runs are given in the following table:
>>
>>     waitns    max time   max loops
>>     -------------------------------
>>   no DELAY     0,5 sec    8650  (maxres = 100000)
>>       1000     0,2 sec      24
>>      10000     0,8 sec       3
>>     100000     7,2 sec       3
>>
>> "time" means spent time in vop_stdfsync() measured from entry to return
>> by a dtrace script. "loops" means the number of times "--maxretry" is
>> executed. I am not sure if DELAY() is the best way to wait or if waiting
>> has other drawbacks. Anyway with DELAY() it does not take more than five
>> iterazions to finish.
> This is not explicitly stated in your message, but I suppose that the
> vop_stdfsync() is called due to VOP_FSYNC(devvp, MNT_SUSPEND) call in
> ffs_sync().  Am I right ?

Yes, the stack trace given by dtrace script looks always like this:
 4  22140               vop_stdfsync:entry
              kernel`devfs_fsync+0x26
              kernel`VOP_FSYNC_APV+0xa7
              kernel`ffs_sync+0x3bb
              kernel`vfs_write_suspend+0x1cd
              geom_journal.ko`g_journal_switcher+0x9a4
              kernel`fork_exit+0x9a
              kernel`0xffffffff8095502e

> If yes, then the solution is most likely to continue looping in the
> vop_stdfsync() until there is no dirty buffers or the mount point
> mnt_secondary_writes counter is zero. The pauses trick you tried might
> be still useful, e.g. after some threshold of the performed loop
> iterations.

I have checked your proposal and found that indeed the
mnt_secondary_writes counter goes to zero when the dirties have reached
zero. During the loop the mnt_secondary_write counter is always equal to
one, so there is not something like a countdown and thats Kirk wanted to
see. A dtrace output (with DELAY of 1ms in the loop) for the biggest
loop count on a three day test is this:

 18  32865      kern_unlinkat:entry    path=bigfile, tid=101201,
tid=101201, execname=rm
 18  12747         ufs_remove:entry    gj=mirror/gmbkp4p5.journal,
inum=11155630, blocks=22301568, size=11415525660
 18  12748        ufs_remove:return    returncode=0, inum=11155630,
blocks=22301568
 18  18902       ffs_truncate:entry    gj=mirror/gmbkp4p5.journal,
inum=11155630, size=11415525660, mnt_flag=0x12001040,
mnt_kern_flag=0x40006142, blocks=22301568
  6  33304  vfs_write_suspend:entry    gj=mirror/gmbkp4p5.journal,
mnt_kern_flag=0x40006142, tid=100181
  6  22140       vop_stdfsync:entry    mounted on /home, waitfor=1,
numoutput=0, clean=10, dirty=6, secondary_writes=1
 10  28117      bufobj_wwait:return    calls to bufobj_wait = 1,
dirtycnt=2, secondary_writes=1
 10  28117      bufobj_wwait:return    calls to bufobj_wait = 2,
dirtycnt=1, secondary_writes=1
 10  28117      bufobj_wwait:return    calls to bufobj_wait = 3,
dirtycnt=1, secondary_writes=1
 10  28117      bufobj_wwait:return    calls to bufobj_wait = 4,
dirtycnt=3, secondary_writes=1
 10  28117      bufobj_wwait:return    calls to bufobj_wait = 5,
dirtycnt=2, secondary_writes=1
  6  28117      bufobj_wwait:return    calls to bufobj_wait = 6,
dirtycnt=3, secondary_writes=1
  6  28117      bufobj_wwait:return    calls to bufobj_wait = 7,
dirtycnt=3, secondary_writes=1
  6  28117      bufobj_wwait:return    calls to bufobj_wait = 8,
dirtycnt=3, secondary_writes=1
  6  28117      bufobj_wwait:return    calls to bufobj_wait = 9,
dirtycnt=3, secondary_writes=1
  6  28117      bufobj_wwait:return    calls to bufobj_wait = 10,
dirtycnt=2, secondary_writes=1
  6  28117      bufobj_wwait:return    calls to bufobj_wait = 11,
dirtycnt=2, secondary_writes=1
  6  28117      bufobj_wwait:return    calls to bufobj_wait = 12,
dirtycnt=3, secondary_writes=1
  6  28117      bufobj_wwait:return    calls to bufobj_wait = 13,
dirtycnt=3, secondary_writes=1
  6  28117      bufobj_wwait:return    calls to bufobj_wait = 14,
dirtycnt=3, secondary_writes=1
  6  28117      bufobj_wwait:return    calls to bufobj_wait = 15,
dirtycnt=4, secondary_writes=1
  6  28117      bufobj_wwait:return    calls to bufobj_wait = 16,
dirtycnt=3, secondary_writes=1
  6  28117      bufobj_wwait:return    calls to bufobj_wait = 17,
dirtycnt=3, secondary_writes=1
  2  18903      ffs_truncate:return    returncode=0, inum=11155630, blocks=0
  2  32866     kern_unlinkat:return    returncode=0, errno=0, number
io's: 791/791
  6  28117      bufobj_wwait:return    calls to bufobj_wait = 18,
dirtycnt=3, secondary_writes=0
  6  28117      bufobj_wwait:return    calls to bufobj_wait = 19,
dirtycnt=0, secondary_writes=0
  6  22141      vop_stdfsync:return    returncode=0, pid=26, tid=100181,
spent 240373850 nsecs

So the spent time in vop_stdfsync() is 0,24 sec in the worst case I
found using DELAY with 1 ms. I would prefer this solution. My first
appoach (simple bumping maxres from 1000 to 100000) is also ok, but max
spend time will be raise up to 0,5 sec. Perhaps you like something like

   if( maxretry < 1000 && maxretry % 10 = 0)
          DELAY(waitns);

That is also ok but does not make a noteworthy difference. The main
argument remains: we have to wait until there are no dirties left.

For me the problem with the "giving up on dirty" is solved.

-- 
Andreas Longwitz