mmap() incoherency on hi I/O load (FS is zfs)

Wed Jul 4 10:00:41 UTC 2012

  --- Original message ---
 From: "Konstantin Belousov" <kostikbel at gmail.com>
 To: "Pavlo" <devgs at ukr.net>
 Date: 4 July 2012, 12:06:44
 Subject: Re: mmap() incoherency on hi I/O load (FS is zfs)

> On Wed, Jul 04, 2012 at 11:07:36AM +0300, Pavlo wrote:
> > 
> > 
> > 
> > --- Original message ---
> > From: "Pavlo" <devgs at ukr.net>
> > To: freebsd-fs at freebsd.org
> > Date: 14 June 2012, 13:30:20
> > Subject: mmap() incoherency on hi I/O load (FS is zfs)
> > 
> > 
> > > There's a case when some parts of files that are mapped and then
> > modified getting corrupted. By corrupted I mean some data is ok (one that
> > was written using write()/pwrite()) but some looks like it never existed.
> > Like it was some time in buffers, when several processes simultaneously
> > (of course access was synchronised) used shared pages and reported it's
> > existence. But after time pass they (processes) screamed that it is now
> > lost. Only part of data written with pwrite() was there. Everything that
> > was written via mmap() is zero.
> > >
> > > So as I said it occurs on hi I/O busyness. When in background 4+
> > processes do indexing of huge ammount of data. Also I want to note, it
> > never occurred in the life of our project  while we used mmap() under
> > same I/O stress conditions when mapping was done for a whole file of just
> > a part(header) starting from a beginning of a file. First time we used
> > mapping of individual pages, just to save RAM, and this popped up.
> > >
> > > Solution for this problem is msync() before any munmap(). But man says:
> > >
> > >
> > 
> > The msync() system call is usually not needed since BSD implements a
> > coherent file system buffer cache.  However, it may be used to associate
> > dirty VM pages with file system buffers and thus cause them to be flushed
> > to physical media sooner rather than later.
> > > 
> > > Any thoughts? Thanks.
> > > 
> > > 
> > 
> > So I tracked issue to the place where it occurs. When I commit data to
> > file using mmap() and pwrite() side by side, sometimes 'newest data' is
> > being overwritten by 'elder data'. From time to time 'elder data' can be
> > something written with mmap() either with pwrite(). It never happens when
> > I use exclusively mmap() either pwrite(). Also this issue reproduces on
> > UFS as well. I think there is a problem keeping mmapep pages and FS cache
> > synced.
> I am curious how do you label data with newer and elder labels.

I have list header like:

struct XXX
{
    uint32_t alloc_size;
    uint32_t list_size;
    node_t   node[1];
}

First I init it with pwrite() setting for example alloc_size to 10 and everything else to 0;

Then add elements with mmap();

1. Workers log elements existence...
2. Workers log elements existence...
... same thing for a few seconds.
X. One of the workers cry that list is empty.

Then I inspect core file and see that list looks like if it was just initialised with pwrite() ie alloc_size equals 10, everything else is 0.
Hard to reproduce because it happen only on really high IO loads. And from tens of thousands of such files only a couple getting corrupted.

> 
> I do admit a possibility of a race in ZFS double-copy implementation of
> the mmap/cache coherency, but somewhat skeptical about the same possibility
> for UFS. What you saying might indicate that we loose modified/dirty bits
> for the page, but that would have much more firework then just eventual
> race with write.
> 
> What version of the system ? Does the machine swap ?

Okay, after msync() helped but didn't fixed issue (just reduced occurrence) I did next thing:
tracked modification of mmaped pages using mprotect(). At the end of session before munpap() saved modified pages, then munmap() then I wrote those pages back to disk.

Later worker accessed those pages again with mmap(), modified them and for some parts of those pages did read() instead of accessing via mmap(). What read() returned was data committed in previous session with write() but not the data, that was just modified by same process via mmap(). We reproduces this again and again on UFS on FreeBSD and only on high IO load. Though we could never reproduce this on Linux (ext4).

> 
> > 
> > I will try to make test to reliably reproduces issue.
> Yes, isolated test case is the best route forward. It would either show
> a bug or demonstrate a misunderstanding on your part.

I am trying, but it's really hard to make example to reproduce this issue.

Thanks for reply.