mmap() incoherency on hi I/O load (FS is zfs)

Wed Jul 4 10:00:29 UTC 2012

> On Wed, Jul 04, 2012 at 12:25:55PM +0300, Pavlo wrote:
> > 
> >   
> > > On Wed, Jul 04, 2012 at 11:07:36AM +0300, Pavlo wrote:
> > > > 
> > > > 
> > > > > There's a case when some parts of files that are mapped and then
> > > > modified getting corrupted. By corrupted I mean some data is ok (one that
> > > > was written using write()/pwrite()) but some looks like it never existed.
> > > > Like it was some time in buffers, when several processes simultaneously
> > > > (of course access was synchronised) used shared pages and reported it's
> > > > existence. But after time pass they (processes) screamed that it is now
> > > > lost. Only part of data written with pwrite() was there. Everything that
> > > > was written via mmap() is zero.
> > > > >
> > > > > So as I said it occurs on hi I/O busyness. When in background 4+
> > > > processes do indexing of huge ammount of data. Also I want to note, it
> > > > never occurred in the life of our project  while we used mmap() under
> > > > same I/O stress conditions when mapping was done for a whole file of just
> > > > a part(header) starting from a beginning of a file. First time we used
> > > > mapping of individual pages, just to save RAM, and this popped up.
> > > > >
> > > > > Solution for this problem is msync() before any munmap(). But man says:
> > > > >
> > > > >
> > > > 
> > > > The msync() system call is usually not needed since BSD implements a
> > > > coherent file system buffer cache.  However, it may be used to associate
> > > > dirty VM pages with file system buffers and thus cause them to be flushed
> > > > to physical media sooner rather than later.
> > > > > 
> > > > > Any thoughts? Thanks.
> > > > > 
> > > > > 
> > > > 
> > > > So I tracked issue to the place where it occurs. When I commit data to
> > > > file using mmap() and pwrite() side by side, sometimes 'newest data' is
> > > > being overwritten by 'elder data'. From time to time 'elder data' can be
> > > > something written with mmap() either with pwrite(). It never happens when
> > > > I use exclusively mmap() either pwrite(). Also this issue reproduces on
> > > > UFS as well. I think there is a problem keeping mmapep pages and FS cache
> > > > synced.
> > > I am curious how do you label data with newer and elder labels.
> > 
> > I have list header like:
> > 
> > struct XXX
> > {
> >     uint32_t alloc_size;
> >     uint32_t list_size;
> >     node_t   node[1];
> > }
> > 
> > First I init it with pwrite() setting for example alloc_size to 10 and everything else to 0;
> > 
> > Then add elements with mmap();
> > 
> > 1. Workers log elements existence...
> > 2. Workers log elements existence...
> > ... same thing for a few seconds.
> > X. One of the workers cry that list is empty.
> > 
> > Then I inspect core file and see that list looks like if it was just initialised with pwrite() ie alloc_size equals 10, everything else is 0.
> > Hard to reproduce because it happen only on really high IO loads. And from tens of thousands of such files only a couple getting corrupted.
> > 
> > > 
> > > I do admit a possibility of a race in ZFS double-copy implementation of
> > > the mmap/cache coherency, but somewhat skeptical about the same possibility
> > > for UFS. What you saying might indicate that we loose modified/dirty bits
> > > for the page, but that would have much more firework then just eventual
> > > race with write.
> > > 
> > > What version of the system ? Does the machine swap ?
> You just ignored these ^^^^^^^^^^^^ questions.

Sorry, forgot to answer. Did in next reply but anyways I'll repeat: 

uname -a
FreeBSD zfs1.dev.ukr.net 8.2-STABLE FreeBSD 8.2-STABLE #7: Wed Aug  3 11:41:58 EEST 2011     root at dev.ukr.net:/usr/obj/usr/src/sys/DEV  i386

Swap is turned off. For known reasons.

Also maybe I confused you with different cases. Thing about list header _does_not_reproduces_on_UFS_. Only on ZFS. 

> 
> > 
> > Okay, after msync() helped but didn't fixed issue (just reduced occurrence) I did next thing:
> > tracked modification of mmaped pages using mprotect(). At the end of session before munpap() saved modified pages, then munmap() then I wrote those pages back to disk.
> > 
> > Later worker accessed those pages again with mmap(), modified them and for some parts of those pages did read() instead of accessing via mmap(). What read() returned was data committed in previous session with write() but not the data, that was just modified by same process via mmap(). We reproduces this again and again on UFS on FreeBSD and only on high IO load. Though we could never reproduce this on Linux (ext4).
> > 
> So you are saying that the following sequence:
> 1. write at offset X
> 2. write into the shared mapping of the same file at offset X
> 3. read at offset X
> performed by single thread can return data at the point (1) instead of
> the data at the point (2) ?
> 
> Knowing how write is implemented for UFS, I find this quite impossible.
> 
> If the actions are executed in the different processes/threads, say
> process 1 executes (1, 2) and process 2 executes (3), or process 1
> executes (1), and process 2 executes (2, 3), then my first guess would
> be a lack of proper synchronization between actions. This would indeed
> makes possible exactly the outcome I described.

This was tested _ONLY_ on UFS. 

Process 1:

1. Write at offset X with mmap();
2. Commit that page again after munmap() with write().

Later process 2.

1. Read at offset X with mmap();
2. Write at offset X with mmap();
3. Read at offset X with read() and see data written by process 1 in (2).

All operations are guarded by lock. Never reproduces on Linux. 
When I remove step (2) for process 1. Never reproduces on UFS but does on ZFS (as I wrote before).
Of course may be my mistakes. But same things done exclusively via mmap() or exclusively via read/write never break file.

> > > 
> > > > 
> > > > I will try to make test to reliably reproduces issue.
> > > Yes, isolated test case is the best route forward. It would either show
> > > a bug or demonstrate a misunderstanding on your part.
> > 
> > I am trying, but it's really hard to make example to reproduce this issue.
> This seems to be the only way forward, at least for you.
> And do answer about the version/swap question.
> 

Roget that. Thanks for reply.