directory listing hangs in "ufs" state

Fri Dec 23 07:56:03 UTC 2011

On 12/22/2011 03:48, Kostik Belousov wrote:
> On Wed, Dec 21, 2011 at 09:03:02PM +0400, Andrey Zonov wrote:
>> On 15.12.2011 17:01, Kostik Belousov wrote:
>>> On Thu, Dec 15, 2011 at 03:51:02PM +0400, Andrey Zonov wrote:
>>>> On Thu, Dec 15, 2011 at 12:42 AM, Jeremy Chadwick
>>>> <freebsd at jdc.parodius.com>wrote:
>>>>
>>>>> On Wed, Dec 14, 2011 at 11:47:10PM +0400, Andrey Zonov wrote:
>>>>>> On 14.12.2011 22:22, Jeremy Chadwick wrote:
>>>>>>> On Wed, Dec 14, 2011 at 10:11:47PM +0400, Andrey Zonov wrote:
>>>>>>>> Hi Jeremy,
>>>>>>>>
>>>>>>>> This is not hardware problem, I've already checked that. I also ran
>>>>>>>> fsck today and got no errors.
>>>>>>>>
>>>>>>>> After some more exploration of how mongodb works, I found that then
>>>>>>>> listing hangs, one of mongodb thread is in "biowr" state for a long
>>>>>>>> time. It periodically calls msync(MS_SYNC) accordingly to ktrace
>>>>>>>> out.
>>>>>>>>
>>>>>>>> If I'll remove msync() calls from mongodb, how often data will be
>>>>>>>> sync by OS?
>>>>>>>>
>>>>>>>> --
>>>>>>>> Andrey Zonov
>>>>>>>>
>>>>>>>> On 14.12.2011 2:15, Jeremy Chadwick wrote:
>>>>>>>>> On Wed, Dec 14, 2011 at 01:11:19AM +0400, Andrey Zonov wrote:
>>>>>>>>>> Have you any ideas what is going on? or how to catch the problem?
>>>>>>>>> Assuming this isn't a file on the root filesystem, try booting the
>>>>>>>>> machine in single-user mode and using "fsck -f" on the filesystem in
>>>>>>>>> question.
>>>>>>>>>
>>>>>>>>> Can you verify there's no problems with the disk this file lives on
>>>>>>>>> as
>>>>>>>>> well (smartctl -a /dev/disk)?  I'm doubting this is the problem, but
>>>>>>>>> thought I'd mention it.
>>>>>>> I have no real answer, I'm sorry.  msync(2) indicates it's effectively
>>>>>>> deprecated (see BUGS).  It looks like this is effectively a
>>>>>>> mmap-version
>>>>>>> of fsync(2).
>>>>>> I replaced msync(2) with fsync(2).  Unfortunately, from man pages it
>>>>>> is not obvious that I can do this. Anyway, thanks.
>>>>> Sorry, that wasn't what I was implying.  Let me try to explain
>>>>> differently.
>>>>>
>>>>> msync(2) looks, to me, like an mmap-specific version of fsync(2).  Based
>>>>> on the man page, it seems that the with msync() you can effectively
>>>>> guaranteed flushing of certain pages within an mmap()'d region to disk.
>>>>> fsync() would flush **all** buffers/internal pages to be flushed to
>>>>> disk.
>>>>>
>>>>> One would need to look at the code to mongodb to find out what it's
>>>>> actually doing with msync().  That is to say, if it's doing something
>>>>> like this (I probably have the semantics wrong -- I've never spent much
>>>>> time with mmap()):
>>>>>
>>>>> fd = open("/some/file", O_RDWR);
>>>>> ptr = mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
>>>>> ret = msync(ptr, 65536, MS_SYNC);
>>>>> /* or alternatively, this:
>>>>> ret = msync(ptr, NULL, MS_SYNC);
>>>>> */
>>>>>
>>>>> Then this, to me, would be mostly the equivalent to:
>>>>>
>>>>> fd = fopen("/some/file", "r+");
>>>>> ret = fsync(fd);
>>>>>
>>>>> Otherwise, if it's calling msync() only on an address/location within
>>>>> the region ptr points to, then that may be more efficient (less pages to
>>>>> flush).
>>>>>
>>>> They call msync() for the whole file.  So, there will not be any
>>>> difference.
>>>>
>>>>
>>>>> The mmap() arguments -- specifically flags (see man page) -- also play
>>>>> a role here.  The one that catches my attention is MAP_NOSYNC.  So you
>>>>> may need to look at the mongodb code to figure out what it's mmap()
>>>>> call is.
>>>>>
>>>>> One might wonder why they don't just use open() with the O_SYNC.  I
>>>>> imagine that has to do with, again, performance; possibly the don't want
>>>>> all I/O synchronous, and would rather flush certain pages in the mmap'd
>>>>> region to disk as needed.  I see the legitimacy in that approach (vs.
>>>>> just using O_SYNC).
>>>>>
>>>>> There's really no easy way for me to tell you which is more efficient,
>>>>> better, blah blah without spending a lot of time with a benchmarking
>>>>> program that tests all of this, *plus* an entire system (world) built
>>>>> with profiling.
>>>>>
>>>> I ran for two hours mongodb with fsync() and got the following:
>>>> STARTED                      INBLK OUBLK MAJFLT MINFLT
>>>> Thu Dec 15 10:34:52 2011         3 192744    314 3080182
>>>>
>>>> This is output of `ps -o lstart,inblock,oublock,majflt,minflt -U mongodb'.
>>>>
>>>> Then I ran it with default msync():
>>>> STARTED                      INBLK OUBLK MAJFLT MINFLT
>>>> Thu Dec 15 12:34:53 2011         0 7241555     79 5401945
>>>>
>>>> There are also two graphics of disk business [1] [2].
>>>>
>>>> The difference is significant, in 37 times!  That what I expected to get.
>>>>
>>>> In commentaries for vm_object_page_clean() I found this:
>>>>
>>>>   *      When stuffing pages asynchronously, allow clustering.  XXX we
>>>>   need a
>>>>   *      synchronous clustering mode implementation.
>>>>
>>>> It means for me that msync(MS_SYNC) flush every page on disk in single IO
>>>> transaction.  If we multiply 4K and 37 we get 150K.  This number is size
>>>> of
>>>> the single transaction in my experience.
>>>>
>>>> +alc@, kib@
>>>>
>>>> Am I right? Is there any plan to implement this?
>>> Current buffer clustering code can only do only async writes. In fact, I
>>> am not quite sure what would consitute the sync clustering, because the
>>> ability to delay the write is important to be able to cluster at all.
>>>
>>> Also, I am not sure that lack of clustering is the biggest problem.
>>> IMO, the fact that each write is sync is the first problem there. It
>>> would be quite a work to add the tracking of the issued writes to the
>>> vm_object_page_clean() and down the stack. Esp. due to custom page
>>> write vops in several fses.
>>>
>>> The only guarantee that POSIX requires from msync(MS_SYNC) is that
>>> the writes are finished when the syscall returned, and not that the
>>> writes are done synchronously. Below is the hack which should help if
>>> the msync()ed region contains the mapping of the whole file, since
>>> it is possible to fsync() the file after all writes are scheduled
>>> asynchronous then. It will causes unneeded metadata update, but I think
>>> it would be much faster still.
>>>
>>>
>>> diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c
>>> index 250b769..a9de554 100644
>>> --- a/sys/vm/vm_object.c
>>> +++ b/sys/vm/vm_object.c
>>> @@ -938,7 +938,7 @@ vm_object_sync(vm_object_t object, vm_ooffset_t
>>> offset, vm_size_t size,
>>>   	vm_object_t backing_object;
>>>   	struct vnode *vp;
>>>   	struct mount *mp;
>>> -	int flags;
>>> +	int flags, fsync_after;
>>>
>>>   	if (object == NULL)
>>>   		return;
>>> @@ -971,11 +971,26 @@ vm_object_sync(vm_object_t object, vm_ooffset_t
>>> offset, vm_size_t size,
>>>   		(void) vn_start_write(vp,&mp, V_WAIT);
>>>   		vfslocked = VFS_LOCK_GIANT(vp->v_mount);
>>>   		vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
>>> -		flags = (syncio || invalidate) ? OBJPC_SYNC : 0;
>>> -		flags |= invalidate ? OBJPC_INVAL : 0;
>>> +		if (syncio&&   !invalidate&&   offset == 0&&
>>> +		    OFF_TO_IDX(size) == object->size) {
>>> +			/*
>>> +			 * If syncing the whole mapping of the file,
>>> +			 * it is faster to schedule all the writes in
>>> +			 * async mode, also allowing the clustering,
>>> +			 * and then wait for i/o to complete.
>>> +			 */
>>> +			flags = 0;
>>> +			fsync_after = TRUE;
>>> +		} else {
>>> +			flags = (syncio || invalidate) ? OBJPC_SYNC : 0;
>>> +			flags |= invalidate ? (OBJPC_SYNC | OBJPC_INVAL) : 0;
>>> +			fsync_after = FALSE;
>>> +		}
>>>   		VM_OBJECT_LOCK(object);
>>>   		vm_object_page_clean(object, offset, offset + size, flags);
>>>   		VM_OBJECT_UNLOCK(object);
>>> +		if (fsync_after)
>>> +			(void) VOP_FSYNC(vp, MNT_WAIT, curthread);
>>>   		VOP_UNLOCK(vp, 0);
>>>   		VFS_UNLOCK_GIANT(vfslocked);
>>>   		vn_finished_write(mp);
>> Thanks, this patch works.  Performance is the same as of using fsync().
>>
>> Actually, Linux uses fsync() inside of msync() if MS_SYNC is set.
>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=mm/msync.c;h=632df4527c0122062d9332a0d483835274ed62f6;hb=HEAD
>>
> I see, indeed Linux fully fsync the whole file if even single page of it
> appeared to be (non-shadowed) mmaped into the msync(MS_SYNC) region.
> I am not sure that we shall follow this behaviour.
>
> Alan, do you agree with the patch above ?

Yes, it's ok.

Alan