fsync: giving up on dirty on ufs partitions running vfs_write_suspend()

Sat Sep 16 11:50:05 UTC 2017

Hello Kirk,

>> Second I found, that the "dirty" situation during vfs_write_suspend()
>> only occurs when a big file (more than 10G on a partition of 116G) is
>> removed. If vfs_write_suspend() is called immediately after "rm
>> bigfile", then in vop_stdfsync() 1000 tries (maxretry) are done to wait
>> for the "rm bigfile" to complete. Because a lot of bitmap writes must be
>> done, the value 1000 is not sufficient on my servers. I have increased
>> maxretry and in the worst case I saw 8650 tries to complete without
>> "dirty". In this case the time spent in vop_stdfsync() was about 0,5
>> seconds. The following patch solves the "dirty problem" for me:
>>
>> --- vfs_default.c.orig  2016-10-24 12:26:57.000000000 +0200
>> +++ vfs_default.c       2017-09-08 12:49:18.059970000 +0200
>> @@ -644,7 +644,7 @@
>>         struct bufobj *bo;
>>         struct buf *nbp;
>>         int error = 0;
>> -       int maxretry = 1000;     /* large, arbitrarily chosen */
>> +       int maxretry = 100000;   /* large, arbitrarily chosen */
>>
>>         bo = &vp->v_bufobj;
>>         BO_LOCK(bo);
> 
> This message has plagued me for years. It started out as a panic,
> then got changed to a printf because I could not get rid of it. I
> was never able to figure out why it should take more than five
> iterations to finish, but obviously it takes more. The 1000 number
> was picked because that just seemed insanely large and I did not
> want to iterate forever. I have no problem with bumping up the
> iteration count if there is some way to figure out that each iteration
> is making forward progress (so we know that we are not in an infinite
> loop). Can you come up with a scheme that can measure forward progress?
> I would much prefer that to just making this number ever bigger.
> 
> 	Kirk McKusick

Ok, I understand your thoughts about the "big loop" and I agree. On the
other side it is not easy to measure the progress of the dirty buffers
because these buffers a created from another process at the same time we
loop in vop_stdfsync(). I can explain from my tests, where I use the
following loop on a gjournaled partition:

   while true; do
      cp -p bigfile bigfile.tmp
      rm bigfile
      mv bigfile.tmp bigfile
   done

When g_journal_switcher starts vfs_write_suspend() immediately after the
rm command has started to do his "rm stuff" (ufs_inactive, ffs_truncate,
ffs_indirtrunc at different levels, ffs_blkfree, ...) the we must loop
(that means wait) in vop_stdfsync() until the rm process has finished
his work. A lot of locking overhead is needed for coordination.
Returning from bufobj_wwait() we always see one left dirty buffer (very
seldom two), that is not optimal. Therefore I have tried the following
patch (instead of bumping maxretry):

--- vfs_default.c.orig  2016-10-24 12:26:57.000000000 +0200
+++ vfs_default.c       2017-09-15 12:30:44.792274000 +0200
@@ -688,6 +688,8 @@
                        bremfree(bp);
                        bawrite(bp);
                }
+               if( maxretry < 1000)
+                       DELAY(waitns);
                BO_LOCK(bo);
                goto loop2;
        }

with different values for waitns. If I run the testloop 5000 times on my
testserver, the problem is triggered always round about 10 times. The
results from several runs are given in the following table:

    waitns    max time   max loops
    -------------------------------
  no DELAY     0,5 sec    8650  (maxres = 100000)
      1000     0,2 sec      24
     10000     0,8 sec       3
    100000     7,2 sec       3

"time" means spent time in vop_stdfsync() measured from entry to return
by a dtrace script. "loops" means the number of times "--maxretry" is
executed. I am not sure if DELAY() is the best way to wait or if waiting
has other drawbacks. Anyway with DELAY() it does not take more than five
iterazions to finish.

-- 
Andreas Longwitz