directory listing hangs in "ufs" state
Andrey Zonov
andrey at zonov.org
Thu Dec 15 11:51:05 UTC 2011
On Thu, Dec 15, 2011 at 12:42 AM, Jeremy Chadwick
<freebsd at jdc.parodius.com>wrote:
> On Wed, Dec 14, 2011 at 11:47:10PM +0400, Andrey Zonov wrote:
> > On 14.12.2011 22:22, Jeremy Chadwick wrote:
> > >On Wed, Dec 14, 2011 at 10:11:47PM +0400, Andrey Zonov wrote:
> > >>Hi Jeremy,
> > >>
> > >>This is not hardware problem, I've already checked that. I also ran
> > >>fsck today and got no errors.
> > >>
> > >>After some more exploration of how mongodb works, I found that then
> > >>listing hangs, one of mongodb thread is in "biowr" state for a long
> > >>time. It periodically calls msync(MS_SYNC) accordingly to ktrace
> > >>out.
> > >>
> > >>If I'll remove msync() calls from mongodb, how often data will be
> > >>sync by OS?
> > >>
> > >>--
> > >>Andrey Zonov
> > >>
> > >>On 14.12.2011 2:15, Jeremy Chadwick wrote:
> > >>>On Wed, Dec 14, 2011 at 01:11:19AM +0400, Andrey Zonov wrote:
> > >>>>
> > >>>>Have you any ideas what is going on? or how to catch the problem?
> > >>>
> > >>>Assuming this isn't a file on the root filesystem, try booting the
> > >>>machine in single-user mode and using "fsck -f" on the filesystem in
> > >>>question.
> > >>>
> > >>>Can you verify there's no problems with the disk this file lives on as
> > >>>well (smartctl -a /dev/disk)? I'm doubting this is the problem, but
> > >>>thought I'd mention it.
> > >
> > >I have no real answer, I'm sorry. msync(2) indicates it's effectively
> > >deprecated (see BUGS). It looks like this is effectively a mmap-version
> > >of fsync(2).
> >
> > I replaced msync(2) with fsync(2). Unfortunately, from man pages it
> > is not obvious that I can do this. Anyway, thanks.
>
> Sorry, that wasn't what I was implying. Let me try to explain
> differently.
>
> msync(2) looks, to me, like an mmap-specific version of fsync(2). Based
> on the man page, it seems that the with msync() you can effectively
> guaranteed flushing of certain pages within an mmap()'d region to disk.
> fsync() would flush **all** buffers/internal pages to be flushed to
> disk.
>
> One would need to look at the code to mongodb to find out what it's
> actually doing with msync(). That is to say, if it's doing something
> like this (I probably have the semantics wrong -- I've never spent much
> time with mmap()):
>
> fd = open("/some/file", O_RDWR);
> ptr = mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
> ret = msync(ptr, 65536, MS_SYNC);
> /* or alternatively, this:
> ret = msync(ptr, NULL, MS_SYNC);
> */
>
> Then this, to me, would be mostly the equivalent to:
>
> fd = fopen("/some/file", "r+");
> ret = fsync(fd);
>
> Otherwise, if it's calling msync() only on an address/location within
> the region ptr points to, then that may be more efficient (less pages to
> flush).
>
They call msync() for the whole file. So, there will not be any difference.
> The mmap() arguments -- specifically flags (see man page) -- also play
> a role here. The one that catches my attention is MAP_NOSYNC. So you
> may need to look at the mongodb code to figure out what it's mmap()
> call is.
>
> One might wonder why they don't just use open() with the O_SYNC. I
> imagine that has to do with, again, performance; possibly the don't want
> all I/O synchronous, and would rather flush certain pages in the mmap'd
> region to disk as needed. I see the legitimacy in that approach (vs.
> just using O_SYNC).
>
> There's really no easy way for me to tell you which is more efficient,
> better, blah blah without spending a lot of time with a benchmarking
> program that tests all of this, *plus* an entire system (world) built
> with profiling.
>
I ran for two hours mongodb with fsync() and got the following:
STARTED INBLK OUBLK MAJFLT MINFLT
Thu Dec 15 10:34:52 2011 3 192744 314 3080182
This is output of `ps -o lstart,inblock,oublock,majflt,minflt -U mongodb'.
Then I ran it with default msync():
STARTED INBLK OUBLK MAJFLT MINFLT
Thu Dec 15 12:34:53 2011 0 7241555 79 5401945
There are also two graphics of disk business [1] [2].
The difference is significant, in 37 times! That what I expected to get.
In commentaries for vm_object_page_clean() I found this:
* When stuffing pages asynchronously, allow clustering. XXX we need a
* synchronous clustering mode implementation.
It means for me that msync(MS_SYNC) flush every page on disk in single IO
transaction. If we multiply 4K and 37 we get 150K. This number is size of
the single transaction in my experience.
+alc@, kib@
Am I right? Is there any plan to implement this?
> All of this would really fall into the hands of the mongodb people to
> figure out, if you ask me. But I should note that mmap() on BSD behaves
> and performs very differently than on, say, Linux; so if the authors
> wrote what they did intended for Linux systems, I wouldn't be too
> surprised. :-)
>
https://jira.mongodb.org/browse/SERVER-663
> > >I'm extremely confused by this problem. What you're describing above
> is
> > >that the process is "stuck in biowr state for a long time", but what you
> > >stated originally was that the process was "stuck in ufs state for a
> > >few minutes":
> >
> > Listing of the directory with mongodb files by ls(1) stuck in "ufs"
> > state when one of mongodb's thread in "biowr" state. It looks like
> > system holds global lock of the file which is msync(2)-ed and can't
> > immediately return from lstat(2) call.
>
> Thanks for the clarification -- yes this helps. To some degree it makes
> sense, some piece of the filesystem or VFS layer is blocking
> intentionally. How to figure out what layer I do not know. Kernel
> folks familiar with this aspect would need to chime in here.
>
> > >>I've got STABLE-8 (r221983) with mongodb-1.8.1 installed on it. A
> > >>couple days ago I observed that listing of mongodb directory stuck in
> > >>a few minutes in "ufs" state.
> > >
> > >Can we narrow down what we're talking about here? Does the process
> > >actually deadlock? Or are you concerned about performance implications?
> > >
> > >I know nothing about this "mongodb" software, but the reason it's
> > >calling msync() is because it wants to try and ensure that the data it
> > >changed in an mmap()-mapped page to be reflected (fully written) on the
> > >disk. This behaviour is fairly common within database software, but
> > >"how often" the software chooses to do this is entirely a design
> > >implementation choice by the authors.
> > >
> > >Meaning: if mongodb is either 1) continually calling msync(), or 2)
> > >waiting for too long a period of time before calling msync(),
> > >performance within the process will suffer. #1 could result in overall
> > >bad performance, while #2 could result in a process that's spending a
> > >lot of time doing I/O (flushing to disk) and therefore appears
> > >"deadlocked" when in fact the kernel/subsystems are doing exactly what
> > >they were told to do.
> > >
> > >Removing the msync() call could result in inconsistent data (possibly
> > >non-recoverable) if the mongodb software crashes or if some other piece
> > >(thread or child? Not sure) expects to open a new fd on that file which
> > >has mmap()'d data.
> >
> > Yes, I clearly understand this. I think of any system tuning
> > instead, but nothing arose in my head.
>
> Nor I. I think this is more of a userland/application thing than a
> kernel thing, but there is a love-and-hate relationship between userland
> and kernel when it comes to the above syscalls and framework.
>
> Wish I could be of more help -- sorry. :-(
>
Your help was very important. Thanks!
>
> --
> | Jeremy Chadwick jdc at parodius.com |
> | Parodius Networking http://www.parodius.com/ |
> | UNIX Systems Administrator Mountain View, CA, US |
> | Making life hard for others since 1977. PGP 4BD6C0CB |
>
>
[1] http://zonov.org/mongodb/155647.png
[2] http://zonov.org/mongodb/155657.png
--
Andrey Zonov
More information about the freebsd-stable
mailing list