directory listing hangs in "ufs" state
Alan Cox
alc at rice.edu
Fri Dec 23 07:56:03 UTC 2011
On 12/22/2011 03:48, Kostik Belousov wrote:
> On Wed, Dec 21, 2011 at 09:03:02PM +0400, Andrey Zonov wrote:
>> On 15.12.2011 17:01, Kostik Belousov wrote:
>>> On Thu, Dec 15, 2011 at 03:51:02PM +0400, Andrey Zonov wrote:
>>>> On Thu, Dec 15, 2011 at 12:42 AM, Jeremy Chadwick
>>>> <freebsd at jdc.parodius.com>wrote:
>>>>
>>>>> On Wed, Dec 14, 2011 at 11:47:10PM +0400, Andrey Zonov wrote:
>>>>>> On 14.12.2011 22:22, Jeremy Chadwick wrote:
>>>>>>> On Wed, Dec 14, 2011 at 10:11:47PM +0400, Andrey Zonov wrote:
>>>>>>>> Hi Jeremy,
>>>>>>>>
>>>>>>>> This is not hardware problem, I've already checked that. I also ran
>>>>>>>> fsck today and got no errors.
>>>>>>>>
>>>>>>>> After some more exploration of how mongodb works, I found that then
>>>>>>>> listing hangs, one of mongodb thread is in "biowr" state for a long
>>>>>>>> time. It periodically calls msync(MS_SYNC) accordingly to ktrace
>>>>>>>> out.
>>>>>>>>
>>>>>>>> If I'll remove msync() calls from mongodb, how often data will be
>>>>>>>> sync by OS?
>>>>>>>>
>>>>>>>> --
>>>>>>>> Andrey Zonov
>>>>>>>>
>>>>>>>> On 14.12.2011 2:15, Jeremy Chadwick wrote:
>>>>>>>>> On Wed, Dec 14, 2011 at 01:11:19AM +0400, Andrey Zonov wrote:
>>>>>>>>>> Have you any ideas what is going on? or how to catch the problem?
>>>>>>>>> Assuming this isn't a file on the root filesystem, try booting the
>>>>>>>>> machine in single-user mode and using "fsck -f" on the filesystem in
>>>>>>>>> question.
>>>>>>>>>
>>>>>>>>> Can you verify there's no problems with the disk this file lives on
>>>>>>>>> as
>>>>>>>>> well (smartctl -a /dev/disk)? I'm doubting this is the problem, but
>>>>>>>>> thought I'd mention it.
>>>>>>> I have no real answer, I'm sorry. msync(2) indicates it's effectively
>>>>>>> deprecated (see BUGS). It looks like this is effectively a
>>>>>>> mmap-version
>>>>>>> of fsync(2).
>>>>>> I replaced msync(2) with fsync(2). Unfortunately, from man pages it
>>>>>> is not obvious that I can do this. Anyway, thanks.
>>>>> Sorry, that wasn't what I was implying. Let me try to explain
>>>>> differently.
>>>>>
>>>>> msync(2) looks, to me, like an mmap-specific version of fsync(2). Based
>>>>> on the man page, it seems that the with msync() you can effectively
>>>>> guaranteed flushing of certain pages within an mmap()'d region to disk.
>>>>> fsync() would flush **all** buffers/internal pages to be flushed to
>>>>> disk.
>>>>>
>>>>> One would need to look at the code to mongodb to find out what it's
>>>>> actually doing with msync(). That is to say, if it's doing something
>>>>> like this (I probably have the semantics wrong -- I've never spent much
>>>>> time with mmap()):
>>>>>
>>>>> fd = open("/some/file", O_RDWR);
>>>>> ptr = mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
>>>>> ret = msync(ptr, 65536, MS_SYNC);
>>>>> /* or alternatively, this:
>>>>> ret = msync(ptr, NULL, MS_SYNC);
>>>>> */
>>>>>
>>>>> Then this, to me, would be mostly the equivalent to:
>>>>>
>>>>> fd = fopen("/some/file", "r+");
>>>>> ret = fsync(fd);
>>>>>
>>>>> Otherwise, if it's calling msync() only on an address/location within
>>>>> the region ptr points to, then that may be more efficient (less pages to
>>>>> flush).
>>>>>
>>>> They call msync() for the whole file. So, there will not be any
>>>> difference.
>>>>
>>>>
>>>>> The mmap() arguments -- specifically flags (see man page) -- also play
>>>>> a role here. The one that catches my attention is MAP_NOSYNC. So you
>>>>> may need to look at the mongodb code to figure out what it's mmap()
>>>>> call is.
>>>>>
>>>>> One might wonder why they don't just use open() with the O_SYNC. I
>>>>> imagine that has to do with, again, performance; possibly the don't want
>>>>> all I/O synchronous, and would rather flush certain pages in the mmap'd
>>>>> region to disk as needed. I see the legitimacy in that approach (vs.
>>>>> just using O_SYNC).
>>>>>
>>>>> There's really no easy way for me to tell you which is more efficient,
>>>>> better, blah blah without spending a lot of time with a benchmarking
>>>>> program that tests all of this, *plus* an entire system (world) built
>>>>> with profiling.
>>>>>
>>>> I ran for two hours mongodb with fsync() and got the following:
>>>> STARTED INBLK OUBLK MAJFLT MINFLT
>>>> Thu Dec 15 10:34:52 2011 3 192744 314 3080182
>>>>
>>>> This is output of `ps -o lstart,inblock,oublock,majflt,minflt -U mongodb'.
>>>>
>>>> Then I ran it with default msync():
>>>> STARTED INBLK OUBLK MAJFLT MINFLT
>>>> Thu Dec 15 12:34:53 2011 0 7241555 79 5401945
>>>>
>>>> There are also two graphics of disk business [1] [2].
>>>>
>>>> The difference is significant, in 37 times! That what I expected to get.
>>>>
>>>> In commentaries for vm_object_page_clean() I found this:
>>>>
>>>> * When stuffing pages asynchronously, allow clustering. XXX we
>>>> need a
>>>> * synchronous clustering mode implementation.
>>>>
>>>> It means for me that msync(MS_SYNC) flush every page on disk in single IO
>>>> transaction. If we multiply 4K and 37 we get 150K. This number is size
>>>> of
>>>> the single transaction in my experience.
>>>>
>>>> +alc@, kib@
>>>>
>>>> Am I right? Is there any plan to implement this?
>>> Current buffer clustering code can only do only async writes. In fact, I
>>> am not quite sure what would consitute the sync clustering, because the
>>> ability to delay the write is important to be able to cluster at all.
>>>
>>> Also, I am not sure that lack of clustering is the biggest problem.
>>> IMO, the fact that each write is sync is the first problem there. It
>>> would be quite a work to add the tracking of the issued writes to the
>>> vm_object_page_clean() and down the stack. Esp. due to custom page
>>> write vops in several fses.
>>>
>>> The only guarantee that POSIX requires from msync(MS_SYNC) is that
>>> the writes are finished when the syscall returned, and not that the
>>> writes are done synchronously. Below is the hack which should help if
>>> the msync()ed region contains the mapping of the whole file, since
>>> it is possible to fsync() the file after all writes are scheduled
>>> asynchronous then. It will causes unneeded metadata update, but I think
>>> it would be much faster still.
>>>
>>>
>>> diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c
>>> index 250b769..a9de554 100644
>>> --- a/sys/vm/vm_object.c
>>> +++ b/sys/vm/vm_object.c
>>> @@ -938,7 +938,7 @@ vm_object_sync(vm_object_t object, vm_ooffset_t
>>> offset, vm_size_t size,
>>> vm_object_t backing_object;
>>> struct vnode *vp;
>>> struct mount *mp;
>>> - int flags;
>>> + int flags, fsync_after;
>>>
>>> if (object == NULL)
>>> return;
>>> @@ -971,11 +971,26 @@ vm_object_sync(vm_object_t object, vm_ooffset_t
>>> offset, vm_size_t size,
>>> (void) vn_start_write(vp,&mp, V_WAIT);
>>> vfslocked = VFS_LOCK_GIANT(vp->v_mount);
>>> vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
>>> - flags = (syncio || invalidate) ? OBJPC_SYNC : 0;
>>> - flags |= invalidate ? OBJPC_INVAL : 0;
>>> + if (syncio&& !invalidate&& offset == 0&&
>>> + OFF_TO_IDX(size) == object->size) {
>>> + /*
>>> + * If syncing the whole mapping of the file,
>>> + * it is faster to schedule all the writes in
>>> + * async mode, also allowing the clustering,
>>> + * and then wait for i/o to complete.
>>> + */
>>> + flags = 0;
>>> + fsync_after = TRUE;
>>> + } else {
>>> + flags = (syncio || invalidate) ? OBJPC_SYNC : 0;
>>> + flags |= invalidate ? (OBJPC_SYNC | OBJPC_INVAL) : 0;
>>> + fsync_after = FALSE;
>>> + }
>>> VM_OBJECT_LOCK(object);
>>> vm_object_page_clean(object, offset, offset + size, flags);
>>> VM_OBJECT_UNLOCK(object);
>>> + if (fsync_after)
>>> + (void) VOP_FSYNC(vp, MNT_WAIT, curthread);
>>> VOP_UNLOCK(vp, 0);
>>> VFS_UNLOCK_GIANT(vfslocked);
>>> vn_finished_write(mp);
>> Thanks, this patch works. Performance is the same as of using fsync().
>>
>> Actually, Linux uses fsync() inside of msync() if MS_SYNC is set.
>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=mm/msync.c;h=632df4527c0122062d9332a0d483835274ed62f6;hb=HEAD
>>
> I see, indeed Linux fully fsync the whole file if even single page of it
> appeared to be (non-shadowed) mmaped into the msync(MS_SYNC) region.
> I am not sure that we shall follow this behaviour.
>
> Alan, do you agree with the patch above ?
Yes, it's ok.
Alan
More information about the freebsd-stable
mailing list